Abstract
Tandem mass spectra capture fragmentation patterns that provide key structural information about molecules. Although mass spectrometry is applied in many areas, the vast majority of small molecules lack experimental reference spectra. For over 70 years, spectrum prediction has remained a key challenge in the field. Existing deep learning methods do not leverage global structure in the molecule, potentially resulting in difficulties when generalizing to new data. In this work we propose the MassFormer model for accurately predicting tandem mass spectra. MassFormer uses a graph transformer architecture to model long-distance relationships between atoms in the molecule. The transformer module is initialized with parameters obtained through a chemical pretraining task, then fine-tuned on spectral data. MassFormer outperforms competing approaches for spectrum prediction on multiple datasets and accurately models the effects of collision energy. Gradient-based attribution methods reveal that MassFormer can identify compositional relationships between peaks in the spectrum. When applied to spectrum identification problems, MassFormer generally surpasses the performance of existing prediction-based methods.
Similar content being viewed by others
Main
Mass spectrometry1,2 (MS), is an analytical technique used for identifying and quantifying chemicals in a mixture. Molecules from the sample are ionized and then detected by a mass analyser, which records information about the mass-to-charge ratio (m/z) of each ion in the form of a mass spectrum. Tandem mass spectrometry (MS/MS) is a variant of MS that includes a fragmentation step to isolate and break down charged molecules (called precursors) into smaller fragments. These ions appear as peaks in the fragment spectrum, and their m/z positions and relative abundances can be used to make inferences about the molecular structure of the original precursor. When coupled with online liquid chromatography (LC), a technique for chemical separation, the combined LC–MS/MS workflow is a powerful tool for analysing aqueous solutions. Because of its versatility, LC–MS/MS is commonly employed in a variety of domains, including proteomics3, metabolomics4,5, forensics6,7 and environmental chemistry8.
For most molecules, it is not possible to accurately simulate the fragmentation that occurs in a mass spectrometer. In principle, theoretical physics provides the tools to understand this process; however, existing first principles simulations are too slow to be used in a high-throughput manner and rely on approximations that limit their accuracy. This presents a fundamental problem for the MS field. Improving spectrum simulation is essential for deepening our knowledge of MS and gleaning valuable insights from experimental data.
Compound identification from MS/MS data is an important application of mass spectrometry, particularly in metabolomics. To identify a spectrum, practitioners often rely on database searches with large reference libraries, using spectrum similarity functions9,10,11,12 and domain expertise to identify matches. However, these databases have relatively poor coverage, containing on the order of 106 spectra, which represent approximately 104 unique compounds. This fails to cover even the relatively small set of human metabolites, which contains at least on the order of 105 compounds, according to the Human Metabolome Database13 (HDMB). One strategy to overcome the incompleteness of spectral libraries is to augment them with in silico mass spectra for compounds from a chemical structure database13,14,15. This can dramatically increase coverage of the reference data and improve the chance of finding a match.
The increasing availability of public13,16,17,18,19 and commercial20,21,22,23 MS datasets makes data-driven solutions to spectrum prediction appealing. One of the most popular spectrum prediction methods, competitive fragmentation modelling24,25,26 (CFM), combines combinatorial framgentation and data-driven probabilistic modelling to predict spectra. CFM has been shown to be effective at predicting and annotating spectra and can be used for compound identification. However, due to its combinatorial bond-breaking algorithm, CFM is slow and struggles with modelling larger compounds (particularly those with multiple rings).
More recently, deep learning approaches have been applied to spectrum prediction. Current models use fully connected neural networks based on molecular fingerprints27 or graph neural networks with molecular graph representation28,29,30. However, both strategies rely on local chemical structures and do not easily model global interactions between atoms distant from each other in the molecule. Molecular fingerprints are typically restricted to capturing subgraphs of a fixed size and cannot represent global structures. Graph neural networks, while more flexible, only model local interactions in a single layer and require increased depth for long-range interactions31. However, excessive depth can often result in over-smoothing32,33, presenting challenges for larger molecules. While fragmentation events may appear to be local, typically involving individual bond breakages or minor rearrangements, global properties of the molecule can affect their likelihood of occurrence. Thus, each event must be understood in the context of the entire molecule.
In this work we adapt a state-of-the-art neural network architecture, the graph transformer34, to tackle MS/MS spectrum prediction. Graph transformers model pairwise interactions between all nodes in the graph and use degree and shortest path information to capture topological properties. Our experiments demonstrate how adapting a pretrained graph transformer model to spectrum prediction can result in state-of-the-art performance. Through rigorous comparisons with strong baseline methods, we demonstrate that our model can more accurately predict spectra for held-out compounds on two different spectrum datasets. We further validate the quality of our model’s predictions by investigating the effects of collision energy on the simulated spectra and interpret MassFormer’s predictions with gradient-based attribution methods. Finally, we demonstrate a realistic application by applying the model to the spectrum identification problem.
Concurrently with this work, a number of new deep learning spectrum predictors have been proposed. Some of these models take into account three-dimensional structural information35, or frame the learning task as formula36,37 or subgraph38,39 prediction. A detailed discussion of these approaches and their advantages and limitations can be found in Supplementary Information Section 2.2.
Results
Overview
MassFormer uses a graph transformer architecture (‘MassFormer architecture’) to predict spectra from an input molecule, represented as a molecular graph. A visual summary of our method is presented in Fig. 1. The input graph is first preprocessed into node and edge embeddings (‘Input featurization’). The node information encodes chemical properties about each atom, such as the element, and centrality properties like degree. The edge information captures topological relationships between atoms in the molecule, such as the shortest path length and the bonds traversed along the path. After preprocessing, the embeddings are passed to a graph transformer, which iteratively applies multihead self-attention (MHA) and multilayer perceptrons (MLPs) to manipulate the data. The learned self-attention weights capture associations globally between all pairs of nodes and are directly influenced by edge embeddings at each iteration. After several rounds of processing, the final embeddings are summarized into a single embedding that represents the entire molecule. This chemical representation is combined with spectrum metadata and passed to an MLP that makes a prediction in the form of a sparse positive vector. Each dimension of the vector represents a binned peak location, with the magnitude corresponding to the peak’s intensity. The metadata describe important information about the precursor (such as the adduct formed during ionization, termed the precursor adduct) and the instrument (such as the collision energy), both of which influence the fragmentation process and resulting spectrum1. The learnable input embeddings and graph transformer parameters are initialized from a Graphormer model34 that is pretrained on a large chemical dataset (‘Pretraining and fine-tuning’), then jointly tuned with the spectrum predictor MLP on the MS/MS dataset.
Predicting spectra for unseen compounds
To quantitatively evaluate our model’s generalization performance, we measure average cosine similarity on different types of held-out spectrum data. MassFormer is compared with two deep learning methods, a fingerprint (FP) neural network model (adapted from ref. 27) and a Weisfeiler–Lehman (WLN) graph neural network model40 (adapted from ref. 28), as well as with CFM24, a widely used probabilistic method for spectrum prediction that does not use deep neural networks (see the section ‘Baseline models’ for more details). The three deep learning models are trained on a portion of the National Institute of Standards and Technology 2020 Tandem Mass Spectrometry dataset21 (NIST). However, to benchmark CFM, we use the most recent publicly available pretrained model26, since retraining CFM on our dataset would be non-trivial due to its design. Differences in modelling assumptions and training data introduce some ambiguity in comparisons between CFM and other models.
Model evaluation is performed on a held-out portion of the NIST test set and an additional dataset from MassBank of North America19 (MoNA). We consider two strategies for splitting the data: a simple random split by compound, based on the International Chemical Identifier Key (InChIKey) and a more challenging split that stratifies compounds by their Murcko scaffold41,42 before splitting. For more details on the datasets and training splits, refer to the section ‘Datasets and training splits’. To allow for a fairer comparison with CFM, we focus on [M + H]+ spectra and remove compounds from the test set that overlap with CFM’s training set. In Extended Data Fig. 1, we demonstrate results for the deep learning models on the full range of supported precursor adducts and confirm that they are consistent with results on the [M + H]+ subset.
MassFormer reliably outperforms other models across different datasets and splitting criteria (Fig. 2c). The performance of the three deep learning models is correlated across data splits, reflecting the underlying differences in split difficulty. NIST-InChIKey is the easiest split, resulting in the best performance; NIST-Scaffold exhibits split-induced distributional shift; MoNA-InChIKey exhibits library-induced distributional shift; MoNA-Scaffold exhibits both of the aforementioned challenges. Ground truth spectra for compounds that exist in both NIST and MoNA tend to be highly similar: ostensibly identical spectra from NIST and MoNA (same compound and metadata) have an average cosine similarity of approximately 0.97, suggesting that batch effect is not a driver of library-induced distributional shift. Thus, the decrease in performance on MoNA splits is most likely due to differences in compound and metadata coverage. Altogether, the experiments demonstrate that model performance strongly depends on the data splitting technique, and that MassFormer is consistently superior across all splits.
We also perform a more granular investigation of MassFormer’s performance by comparing average similarity across the top ten most frequent chemical classes in the NIST dataset (Extended Data Fig. 2), as indicated by ClassyFire43, an automatic chemical ontology tool. While most classes have similar performance, the model seems to perform exceptionally well on ‘lipids and lipid-like molecules’. This is perhaps unsurprising, as some types of lipids have predictable fragmentation patterns, facilitating the success of rule-based fragmenters in lipid spectrum prediction25,44. Conversely, simulating spectra for lipid-like compounds can be challenging for combinatorial methods like CFM due to their large size25,26, so MassFormer’s success in this area is encouraging. MassFormer predictions for individual held-out compounds are presented in Fig. 2a,b, with additional examples included in Extended Data Fig. 6.
Modelling the effect of collision energy
Collision energy is a key experimental parameter that can strongly influence the observed spectrum. Typically, increasing the collision energy results in more intense fragmentation, producing spectra with a higher proportion of smaller fragments. Accurately modelling collision energy is a key desideratum of any spectrum predictor. The relationship between collision energy and fragmentation is well represented in the NIST dataset, since each precursor is measured at ~11 different collision energies on average (see Supplementary Table 1 for breakdown by precursor adduct). Figure 3a illustrates how collision energy typically affects fragmentation. The four spectra (and associated spectrum predictions) all correspond to the same molecule at varying normalized collision energies (NCEs), which are expressions of collision energy relative to precursor mass (equation (14)). As collision energy increases, the peak intensities in the real mass spectra shift to the left, with the model’s predictions closely following this pattern. Figure 3c shows this relationship more generally, by plotting the distribution of mean peak m/z in the spectrum, where each peak is weighted by its relative intensity. Qualitatively, the real and predicted distributions for the held-out spectra are difficult to distinguish.
Explaining peak predictions with gradient-based attribution
While deep learning models are often viewed as black boxes, gradient-based attribution methods can provide some understanding of model behaviour. Broadly speaking, these methods work by taking the gradient of the model’s output with respect to the input features to identify parts of the data that the model is most sensitive to. Gradient × Input (GI) attribution45,46 is one of the simplest methods for gradient-based attribution and can be useful for analysing transformer models47. GI works by measuring the dot product between the input vector and its gradient: this can give information about the effects (positive or negative) of changing that feature on the model’s output (‘Gradient-based feature attribution’).
Each peak in a mass spectrum corresponds to a fragment of the original precursor. In organic compounds, non-oxygen hetero-atoms are much less common than the core carbon/hydrogen/oxygen building blocks, so fragments that contain such hetero-atoms are in some sense related. It is therefore reasonable to expect the input dependencies for these fragments and their corresponding peaks to be correlated.
By applying GI attribution methods, we demonstrate that MassFormer can distinguish peaks based on hetero-atom composition. For each peak, a GI map is calculated using the gradient of the model’s output at the peak location with respect to the input atom embeddings. Because of their high complexity (and redundancy), the GI maps are projected to two dimensions using principal component analysis (PCA) to facilitate interpretation. The SIRIUS formula annotation tool48,49 is applied to identify which peaks contain a hetero-atom of interest. The linear separability of the hetero-atom labels in the projected GI space is used to quantify the peak differences. For each spectrum, we fit a logistic regression model in the two-dimensional projection space (using the hetero-atom peak labels as binary targets) and report its accuracy as a measurement of linear GI map separability.
The entire process is illustrated in Fig. 4. The GI maps for a particular spectrum (Fig. 4a) are computed, projected and plotted in two dimensions (Fig. 4b). In this example, it is clear that the nitrogen and non-nitrogen peaks are perfectly linearly separable. By repeating this analysis for many spectra from the dataset (Fig. 4c), we show that the nitrogen labelling in the projected GI map space results in markedly higher linear separability than random labelling. In Extended Data Fig. 3 we show that this pattern also holds for four other hetero-atoms (chlorine, phosphorus, sulfur and fluorine). All together, these data suggest that MassFormer learns implicit compositional relationships between predicted peaks in the spectrum and that these relationships can be revealed using gradient-based attribution methods.
Identifying spectra by ranking candidates
Spectrum identification is a major application for spectrum prediction models. MS-based compound identification tools (Supplementary Information Section 2.1) are often benchmarked on the Critical Assessment of Small Molecule Identification (CASMI) Contest50,51. This contest works by acquiring experimental spectra for a small set of compounds (called queries) that are not represented in existing public libraries and then scoring competing methods based on their ability to identify the correct structure. We compare MassFormer with the three baseline models (FP, WLN, CFM) on 124 [M + H]+ spectra from the CASMI 2016 Contest. Each of the three deep learning models are trained on a 90% partition of the NIST dataset, with the remaining 10% for validation. Additional filtering is applied to prevent leakage of query compounds into the training or validation data. Each query spectrum is associated with a list of candidate compounds, which the models rank based on the cosine similarity of their predicted spectra with the query. The ranking metrics are summarized in Table 1; additional metrics focusing on different subsets of the candidate sets can be found in Table 2, while plots of the rank distributions can be found in Extended Data Fig. 4. Overall, the deep learning models seem to consistently outperform CFM in all metrics. MassFormer is generally superior to other methods, except for CASMI 2016 top-1, where it is outperformed by the FP model.
The CASMI 2016 Contest has a number of drawbacks as a benchmark dataset. Because of the limited coverage of precursor adducts (only [M + H]+ for positive mode) and collision energies (20, 35 and 50 NCE), it is unclear if the results will generalize to other common experimental configurations. Additionally, many of the compounds in the CASMI 2016 query set have a high similarity to existing compounds in modern spectral libraries. To address these concerns, we also evaluate the models on the more recent CASMI 2022 Contest52 and a novel CASMI-inspired spectrum identification task called NIST20 Outlier, which uses structural outliers from the NIST dataset as queries. Full details about the setup for these tasks can be found in the section ‘Spectrum identification task setup’, and information about the size and diversity of the candidate sets can be found in Extended Data Fig. 5. Overall, CASMI 2022 results are the poorest across the board, at least in part because of the larger number of candidates per query. In general, differences between CASMI 2022 and NIST20 Outlier are less pronounced when comparing normalized metrics, which account for variations in candidate counts. MassFormer outperforms other models across almost every experiment and metric. CASMI 2022 and NIST20 Outlier are particularly challenging for CFM because of the addition of spectra with collision energies and precursor adducts that the model does not support natively.
All together, these experiments demonstrate how gains in spectrum prediction will often, but not always, translate to improvements in spectrum identification29,37,39. MassFormer is the superior spectrum predictor, as indicated by its higher query spectrum similarity; this is generally reflected in its strong ranking metrics. However, the FP model performs surprisingly well in terms of top-k retrieval, even eclipsing MassFormer’s performance in top-1 CASMI 2016, despite its relatively poor accuracy as a spectrum predictor. Spectrum identification requires making predictions for a large number of candidate structures, many of which are out-of-distribution with respect to the training set. Performance on an independent and identically distributed held-out test set does not fully describe a model’s behaviour under distributional shift, which is generally more difficult to estimate. In fact, optimized models which perform similarly on an in-distribution evaluation may perform differently on out-of-distribution data53,54,55.
Discussion
In this work, we introduce MassFormer, a new method for predicting MS/MS spectra for small molecules using a graph transformer architecture. We validate our model’s performance on two independent MS datasets, NIST and MoNA and show that it can produce realistic spectra. We verify that the model captures prior knowledge about the fragmentation process by investigating the effect of collision energy on spectrum predictions. Using gradient-based attributions, we explain model predictions by identifying peaks with similar element composition. We benchmark the model on three different spectrum identification tasks and show that it can be useful for inferring molecular structure. Our work represents one of the first transformer-based spectrum predictors for MS/MS data, with extensive benchmarking and implementations of other models from the literature.
Our method has a number of limitations. The current model is restricted to positive mode electrospray ionization Orbitrap spectra with specific precursor adducts. Extending the model to accommodate additional ionization methods (such as electron ionization and negative mode electrospray ionization), adduct types (such as [M − H]− and [M + K]+) and instrument types (such as quadrupole time-of-flight) would broaden MassFormer’s impact. By supporting additional mass spectrum modalities, the model would be able to leverage new sources of MS data and potentially transfer knowledge across modalities. Prediction resolution is another limitation: most experiments in this work use spectra binned to 1 Da, but many modern instruments provide much higher resolution. This can be useful for identification, as the increased peak resolution can distinguish ions with similar masses. Finally, while our model is explainable to some degree (through gradient-based attribution), it does not provide true peak annotations like some other methods26,36,37,39,49. Formula and fragment annotations are often useful for practitioners: they can improve confidence in the model’s predictions by allowing experienced users to manually validate predicted peak patterns against prior knowledge about fragmentation mechanisms. Additionally, when predicting spectra for identification, peak annotations can be helpful for inferring higher level properties of the compound, even when the spectrum on the whole contains considerable noise.
MassFormer has a number of exciting applications, largely focused on MS-based compound identification. We have already demonstrated how MassFormer can be used to identify a spectrum given a list of candidate structures. SIRIUS49 is a popular tool for identification that does not rely on spectrum prediction. It may be possible to combine MassFormer with SIRIUS (or another existing tool56) to improve structure identifications. For example, SIRIUS uses information in the query spectrum to predict chemical features (such as precursor formulae and fingerprints) that help with identification. These features could be used to refine a set of candidate compounds that would subsequently be ranked by predicted spectrum similarity. Alternatively, MassFormer could improve performance of a spectrum-to-structure generative model57,58,59 by predicting spectra for unlabelled compounds, similar to how CFM has been used to augment training data59. MassFormer could also be incorporated into a reinforcement learning framework, providing a spectral similarity signal that could help guide the structure prediction agent to the correct compound during training and inference60. In targeted metabolomics analysis, spectrum predictors like MassFormer could help identify collision energies that are most useful for distinguishing compounds of similar mass61 by measuring the pairwise similarities of their predicted spectra under different settings. Finally, MassFormer has potential for use in decoy generation62,63, which plays an important role in false discovery rate calibration in untargeted metabolomics experiments. Target-decoy methods work by introducing a number of pseudo-random data points (decoys) that are similar in distribution to real data, allowing for empirical estimation and tuning of confidence thresholds to meet a false discovery rate criterion. Applying MassFormer to predict noisy or adversarial spectra for compounds that are not present in the sample might be a viable strategy for generating realistic decoys and could reduce the chance of incorrect compound identifications.
Methods
Problem formulation
Spectrum prediction can be viewed as a supervised learning problem, with a dataset \({\{(\bf{{x}}^{i},\bf{{z}}^{i},\bf{{y}}^{i})\}}_{i = 1}^{n}\) where xi is a molecule and yi is its spectrum under experimental conditions zi. The goal is to learn the parameters θ of the prediction function \({f}_{\bf{\uptheta} }:{{{\mathcal{X}}}}\times {{{\mathcal{Z}}}}\to {{{\mathcal{Y}}}}\) that maps chemicals in \({{{\mathcal{X}}}}\) to spectra in \({{{\mathcal{Y}}}}\), conditioned on metadata in \({{{\mathcal{Z}}}}\). Mass spectra can be represented as a set of peaks, each of which has an m/z location and an intensity. By discretizing the peak locations into m fixed-width bins (similar to refs. 27,28,29,35), a mass spectrum can be represented as an m-dimensional sparse vector, where each peak at location j has intensity yj ≥ 0. The problem of spectrum prediction can thus be formulated as vector regression, with \({{{\mathcal{Y}}}}={{\mathbb{R}}}^{m}\succcurlyeq 0\). Following ref. 28, the spectral metadata \(\bf{z} \in {{{\mathcal{Z}}}}\) can be provided as side information to the input molecule \(\bf{x} \in {{{\mathcal{X}}}}\). These metadata consist of instrument settings and other covariates that might affect the predicted spectrum.
Input featurization
The featurization of the input molecule x is critical, as it influences the structure of the spectrum prediction function fθ and can have an impact on downstream performance. Molecular fingerprints (also called molecular descriptors) represent molecules using expert-designed feature engineering. Common feature choices include presence of predefined substructures (used in the molecular access system (MACCS) fingerprint64) and hashed local substructure counts (used in the extended connectivity (Morgan) fingerprint65). Molecular graph representations capture the structure of a molecule by explicitly representing atoms as nodes and bonds as edges. The node features can encode various chemical properties associated with the atom (that is, element, formal charge, number of bonded hydrogens), while the edge features can encode bond information (that is, bond type, aromaticity). Such representations naturally lend themselves to graph neural networks and graph transformers (‘MassFormer architecture’) and can be more expressive than fingerprints. Our datasets do not include any conformer or three-dimensional coordinate information, although these properties can be estimated to varying degrees of accuracy42,66 and might be helpful for spectrum prediction35.
The features used for MassFormer are summarized in Supplementary Table 4. The node and edge features were chosen to be identical to the pretrained Graphormer model (‘Pretraining and fine-tuning’), to maintain compatibility. However, some of these features used in the pretrained model (formal charge, radical state and stereochemical information) were not applicable to our data and have been omitted for clarity.
MassFormer architecture
Transformers67 are a general family of neural networks that model relationships between elements of a set. Originally developed for neural machine translation68, transformer models have proven useful in a variety of domains69,70. A number of graph transformers have been proposed34,71,72,73,74,75, motivated by the ability to model pairwise global interactions between all nodes in the graph. Many graph neural networks, such as graph attention network76, are similar in structure to graph transformers but can only model local relations in a single layer and require large depth to model interactions over longer distances31.
Our approach adapts the Graphormer34 architecture, a recent graph transformer model that boasts impressive results on chemical property prediction tasks34,77. The distinguishing characteristic of the Graphormer architecture is its unique positional encoding scheme. It uses shortest path information between nodes and associated edge embeddings along that path as a form of relative positional encoding. The shortest path information is computed as a preprocessing step for each graph, using the Floyd–Warshall algorithm78.
Like most transformers, MassFormer processes the input iteratively using MHA and MLP. Assume the model has L layers, each with M attention heads. Let \({{\bf{h}}}_{i}^{(l)}\in {{\mathbb{R}}}^{d\times 1}\) be the representation of node i at layer l, with d being the node embedding size and \({{{\bf{h}}}_{i}^{(0)}}\) consisting of the node features described in Supplementary Table 4. The MHA operation for attention head m in layer l is described by equation (1), where \({{{W}}}_{V}^{\,(m,l)}\in {{\mathbb{R}}}^{d/M\times d}\) is a learnable projection matrix. The MLP operation is described by equation (2), where \({f}_{\mathbf{\upphi} }^{\;(l)}\) is the multilayer perceptron at layer l of the transformer.
Note that the intermediate representations for the attention heads \({{\bf{h}}}_{i}^{\;(m,l\,)}\in {{\mathbb{R}}}^{d/M\times 1}\) are concatenated along the head dimension before being processed by the MLP \({f}_{\bf{\upphi} }^{\;(l)}\). For simplicity, dropout79 and layer normalization80 operations have been omitted.
The attention mechanism ai,j is described in equations (3) and (4), without layer and attention head indices (to improve clarity). \({{{W}}}_{K},{{{W}}}_{Q}\in {{\mathbb{R}}}^{d/M\times d}\) are the standard learnable key and query projection matrices, \({n}_{i,\;j}\in {\mathbb{N}}\) is the shortest path distance between nodes i and j, \(b({n}_{i,\;j})\in {\mathbb{R}}\) is a learnable scalar indexed by ni,j. The last variable \({c}_{i,\;j}\in {\mathbb{R}}\) is the edge embedding term described by equation (4), where \({{{\bf{e}}}_{i,\;j,p}\in {{\mathbb{R}}}^{d/M\times 1}}\) is the embedding corresponding to the pth edge in the shortest path between i and j, and \({ \bf{{{{w}}}}_{p} \in {{\mathbb{R}}}^{d/M\times 1}}\) is a learnable weight for that position.
Before transformer processing, a readout node (with index N + 1) is initialized with a unique embedding and connected to all other nodes in the graph with a special edge type. Similar to the CLS token in language transformers81, the readout node’s final embedding \({{{\bf{h}}}_{N+1}^{L}\in {{\mathbb{R}}}^{d\times 1}}\) is interpreted as a summarized representation of the input molecule. This chemical embedding is concatenated with the metadata embedding \({{\bf{z}}}\in {{\mathbb{R}}}^{{d}^{{\prime} }\times 1}\) and passed to a large MLP fψ (equation (5)) which outputs a vector \({{\bf{s}}}\in {{\mathbb{R}}}^{{d}^{{\prime}{\prime}}\times 1}\), where d″ is the hidden dimension.
Following ref. 27, we apply bidirectional prediction to get the resulting spectrum \({{{\hat{\bf{y}}}}\in {{\mathbb{R}}}^{q\times 1}}\), where q is the number of mass bins. This process works by calculating a forward spectrum \({{\hat{\bf{y}}}}_{\rm{F}}({{\bf{s}}})\) (equation (6)) and reverse spectrum \({{\hat{\bf{y}}}}_{\rm{R}}({{\bf{s}}})\) (equation (7)), then averaging them bin-wise using a gating mechanism \({{\hat{\bf{y}}}}_{\rm{G}}({{\bf{s}}})\) (equations (8) and (9)). The forward, reverse and gating functions are linear transformations of s parametrized by weight matrices \({{{W}}}_{\rm{F}},{{{W}}}_{\rm{R}},{{{W}}}_{\rm{G}}\in {{\mathbb{R}}}^{q\times {d}^{{\prime}{\prime}}}\) and bias vectors \({{\bf{b}}}_{\rm{F}},{{\bf{b}}}_{\rm{R}},{{\bf{b}}}_{\rm{G}}\in {{\mathbb{R}}}^{q\times 1}\) respectively.
Note that \({\mathbb{I}}[x]\in \{0,1\}\) is the indicator function, mp is the index of the bin corresponding to the precursor mass and \(\tau \in {\mathbb{N}}\) is a tolerance parameter. The final spectrum \({{\hat{\bf{y}}}}\) (equation (10)) is constrained to be non-negative with a rectified linear (ReLU) activation82 and contains zeros for all bin indices i > mp + τ, preventing prediction of peaks that are substantially larger than the precursor. This spectrum can be normalized appropriately depending on the context (L2 for training, L1 for inference).
Pretraining and fine-tuning
Fine-tuning a pretrained model can offer improved performance over training a randomly initialized model from scratch, particularly when data scarcity is a concern. We initialize the parameters of our graph transformer module and trainable input node and edge embeddings with the corresponding parameters of a pretrained Graphormer model. This model was originally trained on the PCQM4Mv2 dataset83,84, a large dataset of approximately 4 million molecular graphs and associated Density Functional Theory simulated quantum chemical properties. The pretraining task is a supervised graph-level regression problem of predicting the energy gap between the highest occupied molecular orbital and the lowest unoccupied molecular orbital (dubbed the HOMO-LUMO gap). While this task is not directly related to MS, the roughly 100-fold larger compound coverage of the PCQM4Mv2 dataset provides an opportunity for the model to learn general chemical representations that transfer to the spectrometry task. In Supplementary Table 7, we perform model ablations to determine relative contributions of different aspects of the fine-tuning process. Our experiments demonstrate that using pretrained weights is necessary to scale up the number of model parameters while maintaining training stability and performance. We also find random re-initialization of the layer normalization80 statistics to be helpful. The other key component of MassFormer, the spectrum prediction MLP, is always initialized randomly. Both modules are fine-tuned jointly for 20 epochs using a linearly decaying learning rate. For full details on the training procedure, please refer to the code repository (‘Code availability’).
Loss and similarity calculations
Since MS peak intensities are relative, it is advisable to use loss functions that are invariant to scaling. We choose cosine distance (equation (11)) as the loss function, where \({{\hat{\bf{y}}}}={f}_{\mathbf{\uptheta} }(\bf{x},{{\bf{z}}})\) is the predicted spectrum and y is the real spectrum. Cosine similarity is commonly used to compare spectra, so minimizing the cosine distance (thus maximizing similarity) is a natural choice and has been shown to work well in other prediction models27,28,29. During model training, a log transformation is applied to the intensities (similar to refs. 28,29) before calculating the loss, to increase the importance of low-intensity peaks in the objective function.
As there are often multiple spectra with different collision energies that correspond to the same precursor, it can be useful to ‘merge’ target and predicted spectra across collision energies by averaging their intensities in binned spectrum space. Variations of collision energy merging are commonplace in the field49,51,56. Merging helps prevent distorted similarity scores resulting from uninformative spectra with very few peaks, which tends to happen when the collision energy is either too high or too low.
Another important consideration is the method of similarity score aggregation. The simplest approach is to average scores over spectra, denoted as ‘spectrum aggregation’. An alternate approach is to first average scores per molecule (that is, across precursor adducts), then average again across molecules. This approach, denoted as ‘molecule aggregation’, ensures that molecules with many precursor adducts are not over-represented in the final score.
Throughout the manuscript (Fig. 2 and Extended Data Figs. 1 and 2), spectral similarity is reported as average cosine similarity on untransformed merged spectra with molecule aggregation. The impact of spectrum intensity transformations (log transform, square root transform and precursor peak removal), similarity functions (cosine, Jensen–Shannon and Jaccard), spectrum merging and score aggregation methods are explored in detail in Supplementary Tables 9 and 10. While individual model scores vary depending on the exact configuration of the similarity calculation (Supplementary Table 10), the order of the models remains largely consistent, with MassFormer being ranked best for the vast majority of configurations (Supplementary Table 9).
Gradient-based feature attribution
GI45,46 is an attribution method that assigns importance scores to input variables based on the sensitivity of the model’s predictions to changes in those variables, estimated using gradients. It satisfies the conservation axiom, an underlying assumption for many explainable AI approaches, which posits that “scores assigned to input variables and forming the explanation must sum to the output of the network”47. If a model is sensitive to changes in certain parts of the input, these features are taken to be more important for making a correct prediction. GI methods compute the attribution score as the dot product of the gradient vector with the input vector. Inputs with scores close to 0 are unimportant, while those with large positive or negative scores are interpreted as contributing positively or negatively (respectively) to the prediction.
More formally, let \({{\bf{x}}}\in {{\mathbb{R}}}^{D}\) be an input vector with D dimensions, let \({{\bf{y}}}\in {{\mathbb{R}}}^{K}\) be an output value associated with x and \({f}_{\bf{\uptheta} }({{\bf{x}}}):{{\mathbb{R}}}^{D}\to {{\mathbb{R}}}^{K}\) be a neural network. Let \({{{\mathcal{L}}}}({{\bf{y}}},{{\hat{\bf{y}}}}):{{\mathbb{R}}}^{K}\times {{\mathbb{R}}}^{K}\to {\mathbb{R}}\) be a scalar loss function (for example, cosine distance). The GI score for the model on this input vector is defined precisely by equation (12):
While GI score calculation is generally applicable to any type of fully differentiable neural network, there are certain aspects of the transformer architecture (the self-attention mechanism67 and layer normalization module80) that violate the conservation axiom and, in practice, reduce the quality of model explanations47. To address this problem, we make slight modifications to these modules when calculating GI scores, as recommended in ref. 47.
In our experiments, the neural network fθ in equation (12) is MassFormer, which maps input molecules to K-dimensional binned spectrum vectors (typically K = 1,000). To compute the GI scores for a peak in bin k, we use a loss function that zeros out all other peaks in the spectrum. More precisely, we define the loss \({{{{\mathcal{L}}}}}_{k}\) for bin k using equation (13):
In the above formulation, ek is the kth standard basis vector for \({{\mathbb{R}}}^{K}\) (in other words, a one-hot vector where the kth entry is 1).
Note that MassFormer takes a molecular graph as input, which is subsequently preprocessed into different types of embeddings (‘Input featurization’). We compute attribution scores only with respect to the element embeddings, as our experiments in the section ‘Explaining peak predictions with gradient-based attribution’ involve discriminating peaks by element composition. To visualize the GI attribution maps (that is, Fig. 4b), we estimate GI scores for each atom and then L2 normalize across atoms, to remove information about gradient magnitude and focus only on direction. We employ a slight variation of the aforementioned GI computation (equation (12)) for the linear projections: instead of summing over the embedding dimension to produce a single scalar score per atom, we perform PCA on the unreduced GI maps. Summing over the dimensions results in more interpretable scores, but might remove variation that could inform the PCA projection.
Our GI analysis is limited to spectra from the NIST-Scaffold test set that satisfy the following criteria:
-
Contain 10 or more peaks
-
Contain at least two hetero-atom peaks and two non-hetero-atom peaks
-
Can be accurately predicted by MassFormer (≥0.8 cosine similarity)
-
Can be accurately annotated by SIRIUS (≥0.8 CSI:FingerID tree score48,49)
We apply these criteria to remove instances where the peaks are trivially separable and focus analysis on examples where the model’s predictions are reliable, since inaccurate predictions can be difficult to interpret.
Datasets and training splits
We use NIST for both training and evaluation; it is a commercial dataset notable for its large coverage (over one million tandem spectra in total), standardized spectrum acquisition protocol and a high degree of manual validation. We also use the publicly available MoNA as a held-out evaluation set. MoNA is an important dataset for the MS community, as it is one of the largest freely accessible sources of small molecule spectra. The repository contains data from a variety of other public sources, including GNPS17, HMDB13 and ReSpect18. For simplicity, we only consider spectra from Orbitrap instruments that use higher-energy collisional dissociation1, as this corresponds to the largest subset of data in NIST. Furthermore, we restrict the dataset to only include positive mode spectra corresponding to one of six highly occurring precursor adducts ([M + H]+, [M + H − H2O]+, [M + H − 2H2O]+, [M + 2H]2+, [M + H − NH3]+ and [M + Na]+). The dataset statistics are summarized in Supplementary Table 1. After filtering, NIST is a much larger dataset than MoNA, both in terms of number of spectra and compounds. For this reason, we rely on NIST data to train the models and use MoNA exclusively for evaluation.
Two kinds of data splitting techniques are employed in the section ‘Predicting spectra for unseen compounds’. Both methods involve splitting spectra based on compound identity, to avoid leakage of spectra that differ only in metadata (such as collision energy or precursor type) but not in structure. The ‘InChIKey’ split uses non-stereochemical InChIKey strings85, which are hashed chemical string representations that are commonly used as molecular identifiers. This approach is, in essence, a simple random split based on compound identity. In contrast, the ‘Scaffold’ split uses Murcko Scaffolds41 to coarsely cluster compounds before splitting in a manner such that all the compounds (and associated spectra) from one cluster end up in the same partition. Scaffold splitting introduces a distributional shift between training and test data and is commonly used to evaluate deep learning models in small molecule applications86. A breakdown of molecule counts for each splitting technique is included in Supplementary Table 2.
Baseline models
We compare our method with two related deep learning models, based on existing approaches from the literature. The fingerprint model combines extended connectivity (Morgan) fingerprints65, MACCS64 and RDKit topological fingerprints (all implemented in the RDKit library42) and uses those directly as the chemical embedding. The WLN model uses a molecular graph representation in combination with a Weisfeiler–Lehman Network40, which is a particular kind of graph neural network, to produce the chemical embedding. These models are based on two previously published MS prediction models: ref. 27 and ref. 28 respectively. However, both models require re-implementation for direct comparison: the former27 is designed for a different type of MS data (electron ionization spectra), while the latter28 does not have publicly available code. We communicated with the authors of ref. 28 to ensure that our WLN model is similar to their best configuration. The parameter counts of each of the deep learning models are summarized in Supplementary Table 3.
In addition, we compare our model to CFM24,26, a well-known tool for small molecule spectrum prediction. CFM is not a deep learning approach: it uses combinatorial fragmentation to determine the set of possible molecule fragments, then fits a probabilistic model to predict the likelihood of the fragments. CFM is designed to predict quadrupole time-of-flight spectra at specific collision energies (10, 20 and 40 eV) and with a limited set of precursor adducts (for positive mode, only [M + H]+). It is not straightforward to retrain CFM on the NIST dataset, which contains Orbitrap spectra covering a wide range of precursor adducts and collision energies. Extending CFM to support these spectra would require non-trivial algorithmic modifications. Additionally, scaling CFM’s training procedure to a dataset as large as NIST (three to four times more training molecules, depending on the split) would be computationally challenging. For these reasons, we use the most recent pretrained version of CFM26 for all experiments. This version was trained on a 3,885-molecule subset of the Metlin dataset22; see Supplementary Table 5 for information about the overlap of CFM’s training set with the other datasets. In our experiments with CFM, we map each NCE to the nearest CFM-supported absolute collision energy. To convert from normalized to absolute collision energy (ACE), we use equation (14), where p is the precursor, m(p) is the precursor mass and c(p) is the charge factor (1.0 for singly charged precursors, 0.9 for doubly charged precursors).
CFM also contains a rule-based module for lipid spectrum prediction25. We employ this module to make predictions instead of the probabilistic approach wherever applicable.
Spectrum identification task setup
The CASMI 2016 (ref. 51) query set contains Orbitrap spectra for 188 unique compounds, with 124 remaining after filtering for supported precursor adducts ([M + H]+) and removing charged compounds. Each spectrum is merged over three NCEs: 20, 35 and 50. Additionally, each query has an associated candidate list assembled by searching the ChemSpider database87 for compounds with similar mass to the precursor (see ref. 51 for full details). After preprocessing, there are on average 1,251 candidate compounds per spectrum.
The CASMI 2022 (ref. 52) query set contains Orbitrap spectra for 500 unique compounds, with 228 remaining after filtering for supported precursor adducts ([M + H]+ or [M + Na]+) and removing compounds with unsupported elements (Supplementary Table 4). The query spectra are extracted from a raw MSP file dump, based on the approach used in ref. 59. Subsequently, each spectrum is merged over three NCEs: 35, 45 and 65. Unlike CASMI 2016, candidate lists are not provided by contest organizers.
To construct the NIST20 Outlier query set, we select compounds from the NIST dataset with unique Murcko Scaffolds. We sample 400 such compounds in total, stratified by molecular weight: 75 with weight <200 Da, 75 with weight in [200, 300) Da, 75 with weight in [300, 400) Da and 75 with weight ≥400 Da. The motivation is to select a diverse group of outlier compounds to use as queries in the identification task. For each compound, all [M + H]+ spectra (of any collision energy) from NIST are merged.
The candidate sets for NIST20 Outlier and CASMI 2022 are selected by sampling compounds from PubChem with molecular mass within a certain tolerance of the true query mass (0.5 ppm for NIST20 Outlier, 10 ppm for CASMI 2022), up to a maximum of 10,000 per query. We filter the candidate lists to remove unsupported elements and multimolecular compounds. Stereochemical information is also removed, with stereoisomeric candidates being deduplicated. After preprocessing, there are on average 2,201 candidates per spectrum for NIST Outlier and 4,849 per spectrum for CASMI 2022. While it would be possible to improve scores of all methods with a more refined candidate structure selection26,49, our approach reflects the scenario where there is little a priori information about the query compound other than its approximate mass and precursor adduct.
The models are scored using an array of different unnormalized and normalized ranking metrics (Tables 1 and 2 and Extended Data Fig. 4) inspired by those typically employed in CASMI contests51,52. Rank corresponds to the average rank of the true candidate compound, with 1 being the best score. Its normalized counterpart, normalized rank, corresponds to the average rank of the true candidate expressed as a fraction of the total number of candidates, with 0 being the best score and 1 being the worst. Top-k accuracy represents the frequency with which the true candidate is ranked in the top k candidates and ranges from 0 to 1. Top-k% accuracy is the normalized equivalent, measuring how often the correct candidate appeared in the top k% of candidates. In contrast to the average rank metrics, top-k and top-k% metrics do not strongly penalize ranking the correct candidate extremely poorly (any rank outside of the top k or k% is equally bad). Orthogonally, the normalized metrics are less sensitive to differences caused by variation in the number of candidates per query, which can be informative.
Following ref. 29, we investigate the effects of candidate-query similarity on retrieval performance. To estimate chemical similarity between molecules, we compute Tanimoto similarity of their MACCS fingerprints. Then, we select the 20% of candidates most and least similar to the query and record ranking metrics on both candidate subsets (Table 2).
Our ranking tasks bear some resemblance to the evaluation in ref. 29, but there are a number of key differences. The authors of ref. 29 use a large set of query spectra, but for each query, they subsample a small number of candidates from the original PubChem set (either randomly, or intentionally choosing the top M candidates most and least similar to the query, where M ≤ 1,000). Our evaluations use a smaller set of query compounds, but for each query sample a large number of candidates (up to 10,000 per query). Their approach also assumes knowledge of the query’s mass formula, further constraining candidate set composition. In contrast, we only assume knowledge of the precursor m/z (common in real-world MS identification problems) and precursor adduct (consistent with CASMI 2016, but uncommon in real-world problems). Although the task in ref. 29 uses unmerged query spectra, we apply collision energy merging (‘Loss and similarity calculations’) to remain consistent with CASMI 2016 and 2022. Our approach increases the information content of the query spectrum, avoiding situations where the query is not specific enough to identify a compound (that is only one or two high-intensity peaks). However, the unmerged approach is also valuable, since such uninformative spectra sometimes appear in real-world untargeted MS experiments (which may only cover a single collision energy), and evaluating model behaviour in these settings could be useful.
Implementation details
All models are implemented in PyTorch88. Our model, MassFormer, uses a modified version of the Graphormer v2 implementation34, which relies on Pytorch Geometric89. The WLN model is implemented using Deep Graph Library90,91. We also adapt some code from ref. 28 for spectrum preprocessing. Before benchmarking the FP and WLN baseline models, we ran a hyperparameter sweep using Weights & Biases92, with a budget of 100 initializations, to find the best-performing configuration on the NIST validation set. The hyperparameters that we optimize are learning rate, weight decay, dropout, minibatch size and architecture specific parameters (such as hidden dimension and number of layers). For full model details and hyperparameter configurations, please refer to the code repository (‘Code availability’).
Data availability
All public data from the study have been uploaded to Zenodo at https://doi.org/10.5281/zenodo.8399738 (ref. 93). Some data that support the findings of this study are available from the National Institute of Standards and Technology (NIST). However, its access is subject to restrictions, requiring the purchase of an appropriate license or special permission from NIST.
Code availability
The code used in this study is open-source (BSD-2-Clause license) and can be found in a GitHub repository (https://github.com/Roestlab/massformer/)94 with a DOI of https://doi.org/10.5281/zenodo.10558852 (ref. 95).
References
Gross, J. H. Mass Spectrometry—A Textbook (Springer, 2011); https://doi.org/10.1007/978-3-319-54398-7
Niessen, W. M. A. & Falck, D. in Analyzing Biomolecular Interactions by Mass Spectrometry Ch. 1 (eds Kool, J. & Niessen, W. M. A.) (Wiley, 2015); https://doi.org/10.1002/9783527673391
Aebersold, R. & Mann, M. Mass-spectrometric exploration of proteome structure and function. Nature 537, 347–355 (2016).
Gowda, G. A. N. & Djukovic, D. Overview of mass spectrometry-based metabolomics: opportunities and challenges. Methods Mol. Biol. 1198, 3–12 (2014).
De Vijlder, T. & Cuyckens, F. A tutorial in small molecule identification via electrospray ionization-mass spectrometry: the practical art of structural elucidation. Mass Spectrom. Rev. 37, 607–629 (2018).
Peters, F. T. Recent advances of liquid chromatography-(tandem) mass spectrometry in clinical and forensic toxicology. Clin. Biochem. 44, 54–65 (2011).
Van Bocxlaer, J. F. et al. Liquid chromatography-mass spectrometry in forensic toxicology. Mass Spectrom. Rev. 19, 165–214 (2000).
Lebedev, A. T. Environmental mass spectrometry. Ann. Rev. Anal.Chem. 6, 163–189 (2013).
Stein, S. E. & Scott, D. R. Optimization and testing of mass spectral library search algorithms for compound identification. J. Am. Soc. Mass Spectrom. 5, 859–866 (1994).
Li, Y. et al. Spectral entropy outperforms MS/MS dot product similarity for small-molecule compound identification. Nat. Methods 18, 1524–1531 (2021).
Majewski, S. et al. The Wasserstein distance as a dissimilarity measure for mass spectra with application to spectral deconvolution. In 18th International Workshop on Algorithms in Bioinformatics (eds Parida, L. & Ukkonen, E.) 25:1–25:21 (WABI, 2018); https://doi.org/10.4230/LIPICS.WABI.2018.25
Benton, H. P., Wong, D. M., Trauger, S. A. & Siuzdak, G. XCMS2: processing tandem mass spectrometry data for metabolite identification and structural characterization. Anal. Chem. 80, 6382–6389 (2008).
Wishart, D. S. et al. HMDB 4.0: the human metabolome database for 2018. Nucleic Acids Res. 46, 608–617 (2018).
Kim, S. et al. PubChem 2019 update: improved access to chemical data. Nucleic Acids Res. 47, 1102–1109 (2019).
Kanehisa, M., Furumichi, M., Sato, Y., Ishiguro-Watanabe, M. & Tanabe, M. KEGG: integrating viruses and cellular organisms. Nucleic Acids Res. 49, 545–551 (2021).
Horai, H. et al. MassBank: a public repository for sharing mass spectral data for life sciences. J. Mass Spectrom. 45, 703–714 (2010).
Wang, M. et al. Sharing and community curation of mass spectrometry data with Global Natural Products Social Molecular Networking. Nat. Biotechnol. 34, 828–837 (2016).
Sawada, Y. et al. RIKEN tandem mass spectral database (ReSpect) for phytochemicals: a plant-specific MS/MS-based data resource and database. Phytochemistry 82, 38–45 (2012).
MassBank of North America (MoNA, 2022); https://mona.fiehnlab.ucdavis.edu/
Stein, S. Mass spectral reference libraries: an ever-expanding resource for chemical identification. Anal. Chem. 84, 7274–7282 (2012).
Yang, X., Neta, P. & Stein, S. E. Quality control for building libraries from electrospray ionization tandem mass spectra. Anal. Chem. 86, 6393–6400 (2014).
Guijas, C. et al. METLIN: a technology platform for identifying knowns and unknowns. Anal. Chem. 90, 3156–3164 (2018).
Wiley Registry of Mass Spectral Data 2023 (Wiley, 2023); https://sciencesolutions.wiley.com/solutions/technique/gc-ms/wiley-registry-of-mass-spectral-data/
Allen, F., Greiner, R. & Wishart, D. Competitive fragmentation modeling of ESI-MS/MS spectra for putative metabolite identification. Metabolomics 11, 98–110 (2015).
Djoumbou-Feunang, Y. et al. CFM-ID 3.0: significantly improved ESI-MS/MS prediction and identification. Metabolites 9, 72 (2019).
Wang, F. et al. CFM-ID 4.0: more accurate ESI-MS/MS spectral prediction and compound identification. Anal. Chem. 93, 11692–11700 (2021); https://doi.org/10.1021/acs.analchem.1c01465
Wei, J. N., Belanger, D., Adams, R. P. & Sculley, D. Rapid prediction of electron-ionization mass spectrometry using neural networks. ACS Cent. Sci. 5, 700–708 (2019).
Zhu, H., Liu, L. & Hassoun, S. Using graph neural networks for mass spectrometry prediction. Preprint at https://arxiv.org/abs/2010.04661 (2020).
Li, X., Zhu, H., Liu, L.-p. & Hassoun, S. Ensemble spectral prediction (ESP) model for metabolite annotation. Preprint at https://arxiv.org/abs/2203.13783 (2022).
Zhang, B., Zhang, J., Xia, Y., Chen, P. & Wang, B. Prediction of electron ionization mass spectra based on graph convolutional networks. Int. J. Mass Spectrom. 475, 116817 (2022).
Xu, K., Hu, W., Leskovec, J. & Jegelka, S. How powerful are graph neural networks? In 7th International Conference on Learning Representations, ICLR 2019 (OpenReview.net, 2019); https://openreview.net/forum?id=B1gabhRcYX
Chen, D. et al. Measuring and relieving the over-smoothing problem for graph neural networks from the topological view. In Proc. AAAI Conference on Artificial Intelligence Vol. 34, 3438–3445 (AAAI Press, 2020); https://doi.org/10.1609/aaai.v34i04.5747
Liu, M., Gao, H. & Ji, S. Towards deeper graph neural networks. In Proc. 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining 338–348 (Association for Computing Machinery, 2020); https://doi.org/10.1145/3394486.3403076
Ying, C. et al. Do transformers really perform bad for graph representation? In Advances in Neural Information Processing Systems 34 (NeurIPS 2021) (eds Ranzato, M. et al.) 28877–28888 (Curran Associates, 2021).
Hong, Y. et al. 3DMolMS: prediction of tandem mass spectra from 3D molecular conformations. Bioinformatics 39, btad354 (2023); https://doi.org/10.1093/bioinformatics/btad354
Murphy, M. et al. Efficiently predicting high resolution mass spectra with graph neural networks. Proc. 40th International Conference on Machine Learning (ICML 2023) Vol. 70 (eds Krause, A. et al.), 25549–25562 (PMLR, 2023).
Goldman, S., Bradshaw, J., Xin, J. & Coley, C. W. Prefix-tree decoding for predicting mass spectra from molecules. In Advances in Neural Information Processing Systems 36 (NeurIPS 2023) (eds Oh, A. et al.) 48548–48572 (Curran Associates, 2023).
Zhu, R. L. & Jonas, E. Rapid approximate subset-based spectra prediction for electron ionization-mass spectrometry. Anal. Chem. 95, 2653–2663 (2023).
Goldman, S., Li, J. & Coley, C. W. Generating molecular fragmentation graphs with autoregressive neural networks. Anal. Chem. 96, 3419–3428 (2024).
Jin, W., Coley, C., Barzilay, R. & Jaakkola, T. Predicting organic reaction outcomes with Weisfeiler-Lehman network. In Advances in Neural Information Processing Systems 30 (eds Guyon, I. et al.) (Curran Associates, 2017).
Bemis, G. W. & Murcko, M. A. The properties of known drugs. 1. Molecular frameworks. J. Med.Chem. 39, 2887–2893 (1996).
Landrum, G. RDKit: open-source cheminformatics software. Zenodo https://doi.org/10.5281/zenodo.4973812 (2021).
Djoumbou Feunang, Y. et al. ClassyFire: automated chemical classification with a comprehensive, computable taxonomy. J. Cheminformatics 8, 61 (2016).
Kind, T. et al. LipidBlast in silico tandem mass spectrometry database for lipid identification. Nat. Methods 10, 755–758 (2013).
Shrikumar, A., Greenside, P. & Kundaje, A. Learning important features through propagating activation differences. In Proc. 34th International Conference on Machine Learning (ICML 2017) Vol. 70 (eds Precup, D. & Teh, Y. W.) 3145–3153 (PMLR, 2017).
Ancona, M., Ceolini, E., Öztireli, C. & Gross, M. Towards better understanding of gradient-based attribution methods for deep neural networks. In 6th International Conference on Learning Representations, ICLR 2018 (OpenReview.net, 2018); https://openreview.net/forum?id=Sy21R9JAW
Ali, A. et al. XAI for transformers: better explanations through conservative propagation. In Proc. 39th International Conference on Machine Learning Vol. 162 (eds Chaudhuri, K. et al.) 435–451 (PMLR, 2022).
Dührkop, K., Shen, H., Meusel, M., Rousu, J. & Böcker, S. Searching molecular structure databases with tandem mass spectra using CSI:FingerID. Proc. Natl Acad. Sci. USA 112, 12580–12585 (2015).
Dührkop, K. et al. SIRIUS 4: a rapid tool for turning tandem mass spectra into metabolite structure information. Nat. Methods 16, 299–302 (2019).
Schymanski, E. L. & Neumann, S. CASMI: and the winner is. Metabolites 3, 412–439 (2013).
Schymanski, E. L. et al. Critical assessment of small molecule identification 2016: automated methods. J. Cheminform. 9, 22 (2017).
Revisiting CASMI. Fiehn Laboratory https://fiehnlab.ucdavis.edu/casmi (2022).
McCoy, R. T., Min, J. & Linzen, T. BERTs of a feather do not generalize together: large variability in generalization across models with similar test set performance. In Proc. 3rd BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP (eds Alishahi, A. et al.) 217–227 (Association for Computational Linguistics, 2020).
Zhou, X., Nie, Y., Tan, H. & Bansal, M. The curse of performance instability in analysis datasets: consequences, source, and suggestions. In Proc. 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) (eds Webber, B. et al.) 8215–8228 (Association for Computational Linguistics, 2020).
D’Amour, A. et al. Underspecification presents challenges for credibility in modern machine learning. J. Mach. Learn. Res. 23, 1–61 (2022).
Goldman, S. et al. Annotating metabolite mass spectra with domain-inspired chemical formula transformers. Nat. Mach. Intell. 5, 965–979 (2023).
Shrivastava, A. D. et al. MassGenie: a transformer-based deep learning method for identifying small molecules from their mass spectra. Biomolecules 11, 1793 (2021).
Stravs, M. A., Dührkop, K., Böcker, S. & Zamboni, N. MSNovelist: de novo structure generation from mass spectra. Nat. Methods 19, 865–870 (2022).
Butler, T. et al. MS2Mol: A transformer model for illuminating dark chemical space from mass spectra. Preprint at https://doi.org/10.26434/chemrxiv-2023-vsmpx-v2 (2023).
Jonas, E. Deep imitation learning for molecular inverse problems. In Advances in Neural Information Processing Systems 32 (NeurIPS 2019) (eds Wallach, H. et al.) 4991–5001 (Curran Associates, 2019); https://proceedings.neurips.cc/paper_files/paper/2019/file/b0bef4c9a6e50d43880191492d4fc827-Paper.pdf
Shanthamoorthy, P., Young, A. & Röst, H. Analyzing assay specificity in metabolomics using unique ion signature simulations. Anal. Chem. 93, 11415–11423 (2021).
Hoffmann, M. A. et al. High-confidence structural annotation of metabolites absent from spectral libraries. Nature Biotechnol. 40, 411–421 (2021); https://doi.org/10.1038/s41587-021-01045-9
Scheubert, K. et al. Significance estimation for large scale metabolomics annotations by spectral matching. Nat. Commun. 8, 1494 (2017).
Durant, J. L., Leland, B. A., Henry, D. R. & Nourse, J. G. Reoptimization of MDL keys for use in drug discovery. J. Chem. Inf. Comput. Sci. 42, 1273–1280 (2002).
Rogers, D. & Hahn, M. Extended-connectivity fingerprints. J. Chem. Inf. Model. 50, 742–754 (2010).
Zhou, G. et al. Uni-Mol: a universal 3D molecular representation learning framework. In The 11th International Conference on Learning Representations (OpenReview.net, 2022); https://openreview.net/forum?id=6K2RM6wVqKu
Vaswani, A. et al. Attention is all you need. In Advances in Neural Information Processing Systems 30 (NeurIPS 2017) (eds Guyon, I. et al.) (Curran Associates, 2017).
Tan, Z. et al. Neural machine translation: a review of methods, resources, and tools. AI Open 1, 5–21 (2020).
Janner, M., Li, Q. & Levine, S. Offline reinforcement learning as one big sequence modeling problem. In Advances in Neural Information Processing Systems 34 (NeurIPS 2021) (eds Ranzato, M. et al.) 1273–1286 (Curran Associates, 2021).
Dosovitskiy, A. et al. An image is worth 16x16 words: transformers for image recognition at scale. In 9th International Conference on Learning Representations, ICLR 2021 (OpenReview.net, 2021); https://openreview.net/forum?id=YicbFdNTTy
Ahmadi, A. H. K., Hassani, K., Moradi, P., Lee, L., & Morris, Q. Memory-based graph networks. In 8th International Conference on Learning Representations, ICLR 2020 (OpenReview.net, 2020); https://openreview.net/forum?id=r1laNeBYPB
Mialon, G., Chen, D., Selosse, M. & Mairal, J. GraphiT: encoding graph structure in transformers. Preprint at https://arxiv.org/abs/2106.05667 (2021).
Maziarka, L. et al. Molecule attention transformer. Preprint at https://arxiv.org/abs/2002.08264 (2020).
Rong, Y. et al. Self-supervised graph transformer on large-scale molecular data. In Advances in Neural Information Processing Systems 33 (NeurIPS 2020) (eds Larochelle, H. et al.) 12559–12571 (Curran Associates, 2020).
Hu, W. et al. Open graph benchmark: datasets for machine learning on graphs. In Advances in Neural Information Processing Systems 33 (NeurIPS 2020) (eds Larochelle, H. et al.) 22118–22133 (Curran Associates, 2020).
Velickovic, P. et al. Graph attention networks. In 6th International Conference on Learning Representations, ICLR 2018 (OpenReview.net, 2018); https://openreview.net/forum?id=rJXMpikCZ
Hu, W. et al. Open graph benchmark: datasets for machine learning on graphs. In Proc. 34th International Conference on Neural Information Processing Systems (eds Larochelle, H. et al.) 1855 (Curran Associates, 2020).
Floyd, R. W. Algorithm 97: shortest path. Commun. ACM 5, 345 (1962).
Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I. & Salakhutdinov, R. Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15, 1929–1958 (2014).
Ba, J. L., Kiros, J. R. & Hinton, G. E. Layer normalization. Preprint at https://arxiv.org/abs/1607.06450 (2016).
Devlin, J., Chang, M.-W., Lee, K. & Toutanova, K. BERT: pre-training of deep bidirectional transformers for language understanding. In Proc. 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT 2019) Vol. 1 (eds Burstein, J. et al.) 4171–4186 (Association for Computational Linguistics, 2019).
Nair, V. & Hinton, G. E. Rectified linear units improve restricted Boltzmann machines. In Proc. 27th International Conference on International Conference on Machine Learning (eds Fürnkranz, J. & Joachims, T.) 807–814 (Omnipress, 2010).
Hu, W. et al. OGB-LSC: a large-scale challenge for machine learning on graphs. In Proc. Neural Information Processing Systems Track on Datasets and Benchmarks (eds Vanschoren, J. & Yeung, S.) (Curran Associates, 2021).
Nakata, M. & Shimazaki, T. PubChemQC project: a large-scale first-principles electronic structure database for data-driven chemistry. J. Chem. Info. Mod. 57, 1300–1308 (2017).
Heller, S. R., McNaught, A., Pletnev, I., Stein, S. & Tchekhovskoi, D. InChI, the IUPAC international chemical identifier. J. Cheminformatics 7, 23 (2015).
Wu, Z. et al. MoleculeNet: a benchmark for molecular machine learning. Chem. Sci. 9, 513–530 (2018).
Pence, H. E. & Williams, A. ChemSpider: an online chemical information resource. J. Chem. Educ. 87, 1123–1124 (2010).
Paszke, A. et al. PyTorch: an imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems 32 (NeurIPS 2019) (eds Wallach, H. et al.) (Curran Associates, 2019).
Fey, M. & Lenssen, J. E. Fast graph representation learning with PyTorch Geometric. In ICLR Workshop on Representation Learning on Graphs and Manifolds (OpenReview.net, 2019); https://rlgm.github.io/papers/2.pdf
Wang, M. et al. Deep Graph Library: a graph-centric, highly-performant package for graph neural networks. Preprint at https://arxiv.org/abs/1909.01315 (2020).
Li, M. et al. DGL-LifeSci: an open-source toolkit for deep learning on graphs in life science. ACS Omega 6, 27233–27238 (2021).
Biewald, L. Experiment tracking with Weights & Biases. Weights & Biases http://wandb.com (2020).
Young, A., Wang, B. & Röst, H. Public Data files for MassFormer. Zenodo https://doi.org/10.5281/zenodo.8399738 (2023).
Young, A. Roestlab/massformer. GitHub https://github.com/Roestlab/massformer/ (2024).
Young, A. Roestlab/massformer v0.4.0 Zenodo https://doi.org/10.5281/zenodo.10558852 (2024).
WELCH, B. L. The generalization of ‘student’s’ problem when several different population varlances are involved. Biometrika 34, 28–35 (1947).
Šidák, Z. Rectangular confidence regions for the means of multivariate normal distributions. J. Am. Stat. Assoc. 62, 626–633 (1967).
Acknowledgements
Resources used in preparing this research were provided, in part, by the Province of Ontario, the Government of Canada, through the Canadian Institute for Advanced Research (CIFAR) and companies sponsoring the Vector Institute. This research was also enabled in part by support provided by Compute Ontario (https://www.computeontario.ca/) and the Digital Research Alliance of Canada (alliancecan.ca). A.Y. is supported by a Natural Sciences and Engineering Research Council of Canada (NSERC) Postgraduate Scholarship (Doctoral Program) and a Vector Institute research grant. H.R. is supported by NSERC, the Canadian Institutes for Health Research (CIHR), the Canadian Foundation for Innovation, the Canada Research Coordinating Committee (CRCC), the John R. Evans Leaders Fund and the Canada Research Chair Program. B.W. is supported by NSERC (grants: RGPIN-2020-06189 and DGECR-2020-00294), the Peter Munk Cardiac Centre AI Fund at the University Health Network and the CIFAR AI Chair Program. We thank B. Lieng, P. Shanthamoorthy, R. Montenegro-Burke and Q. Morris for helpful discussions. We thank C. Harrigan for feedback on the figures. We thank F. Wang for help with the CFM baseline experiments. We thank S. Ma, P. Fradkin, A. Toma and C. Wang for feedback on the manuscript.
Author information
Authors and Affiliations
Contributions
A.Y., H.R. and B.W. conceived the project. A.Y. wrote the computer code and ran the experiments. H.R. and B.W. supervised the work. A.Y., H.R. and B.W. wrote the manuscript.
Corresponding authors
Ethics declarations
Competing interests
The authors declare no competing interests.
Peer review
Peer review information
Nature Machine Intelligence thanks Sebastian Böcker and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Extended data
Extended Data Fig. 1 Additional Spectrum Similarity Experiments.
A more detailed comparison of the deep learning models that does not involve filtering compounds based on overlap with CFM’s training set. Training set sizes (N) are indicated for each split. (a) Test set cosine similarity (1 Da bin resolution) when training and evaluating on [M+H]+ spectra, (b) Test set cosine similarity (1 Da bin resolution) training and evaluating on all six supported precursor adducts. MassFormer demonstrates strong performance in both cases. Averages and standard deviations from 10 independently trained models are reported. Statistical significance is determined by one-sided Welch’s t-test with Šidák correction.
Extended Data Fig. 2 ClassyFire Similarity Experiments.
Cosine similarity (1 Da bin resolution) on spectra corresponding to the top 10 most frequent chemical classes from the NIST-Scaffold test set ([M+H]+ adducts only). The chemical classes, identified by ClassyFire, are sorted from most to least frequent on the x-axis and are not necessarily disjoint. The average performance for each model (across all compounds) is indicated by a black dashed line. (a) MassFormer scores, (b) CFM scores. MassFormer performs better than CFM in each category. Strikingly, MassFormer performs best on “lipids and lipid-like molecules", which is one class that CFM seems to struggle with. Averages and standard deviations from 10 independently trained models are reported (except for CFM, which is pretrained).
Extended Data Fig. 3 Additional Hetero-atom Peak Separability Experiments.
Linear peak classification accuracy distributions for four hetero-atoms: (a) chlorine, (b) sulfur, (c) fluorine, (d) phosphorus. For each hetero-atom, the distribution of optimal linear classification accuracy induced by the hetero-atom labelling strategy is markedly different from the random labelling distribution (higher accuracy indicates improved separability of the peaks). Sample size and statistical significance (Welch’s t-test with Šidák correction) for separability differences are provided for each plot.
Extended Data Fig. 4 Spectrum Identification Rank Distributions.
MassFormer ranks candidate structures more correctly than competing approaches. (a) Distributions of the matching candidate’s predicted rank for CASMI 2016, CASMI 2022, and NIST20 Outlier queries. (b) Corresponding distributions of the matching candidate’s normalized rank. Note that for both metrics, a lower score is better. MassFormer’s rank and normalized rank distributions are more strongly skewed towards lower values. Boxplot lines represent median and interquartile range, whiskers represent 1.5 times the interquartile range, and the “X" symbol represents the mean.
Extended Data Fig. 5 Spectrum Identification Candidate Set Statistics.
Different spectrum identification tasks vary in terms of the diversity and size of their candidate sets. (a,c,e) Distribution of Tanimoto similarities between candidate molecules and their corresponding queries for (a) CASMI 2016, (c) CASMI 2022, (e) NIST20 Outlier. (b,d,f) Distribution of the number of candidates per query for (b) CASMI 2016, (d) CASMI 2022, (f) NIST20 Outlier. The CASMI 2022 and NIST20 Outlier candidate sets are sampled from PubChem using the query’s molecular weight. The NIST20 Outlier dataset uses a smaller weight tolerance (0.5ppm) than CASMI 2022 (10 ppm), resulting in fewer candidates with higher chemical similarity to the query.
Extended Data Fig. 6 Additional Prediction Examples.
Twelve spectrum predictions of varying accuracy, roughly covering a range of 0.4 to 1.0 cosine similarity. All spectra are merged over multiple collision energies. The predictions are described in terms of InChIKey-14, precursor adduct, and cosine similarity with ground truth. (a) AWXJBCZMGZDXCG, [M+H]+, 0.46 (b) GLFJFDAJNJYPGW, [M+H]+, 0.47 (c) HIEYVTSQMLHJEZ, [M+H]+, 0.51 (d) JNKVBUQSDAHKDQ, [M+H-H2O]+, 0.59 (e) WDVCZSSWRMVHAU, [M+H-H2O]+, 0.65 (f) YCTAOQGPWNTYJE, [M+H]+, 0.66 (g) CILGSELJQXSDBE, [M+H]+, 0.72 (h) XCDOHVHQWSFAEN, [M+H-2H2O]+, 0.79 (i) ISNRVVKKHPECQN, [M+H-H2O]+, 0.80 (j) XBGGUPMXALFZOT, [M+H]+, 0.86 (k) DTLKTHCXEMHTIQ, [M+H-2H2O]+, 0.91 (l) BLJBQVQHDXUDTE, [M+H]+, 0.98.
Supplementary information
Supplementary Information
Supplementary text and Tables 1–9.
Supplementary Table 10
The complete set of average similarity scores, across all data splits, for each of the four models (CFM, FP, WLN, MF). There are 56 different methods of similarity calculation. Each variant is defined by a particular intensity transformation (no transform, log transform, square root transform, precursor peak removal), similarity function (cosine, Jensen–Shannon, Jaccard), collision energy merging strategy (merging, no merging) and score aggregation method (spectrum averaging, molecule averaging). Note that intensity transformations are not used in combination with the Jaccard similarity function, which assumes binary intensities.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Young, A., Röst, H. & Wang, B. Tandem mass spectrum prediction for small molecules using graph transformers. Nat Mach Intell 6, 404–416 (2024). https://doi.org/10.1038/s42256-024-00816-8
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1038/s42256-024-00816-8