Introduction

The discovery of new organic reactions has always been in the focus of synthetic organic chemistry. Each new reaction enriches the arsenal of synthetic tools and opens new horizons in the development and optimization of new drugs and materials. Such reactions are often given the names of their discoverers, which is the highest recognition of their contribution to organic chemistry. Most of the new reactions have been discovered by plain luck, and it has been up to the chemists to notice the discovery and apply their “chemical intuition” to study it in detail1. The beginning of a systematic approach to the search for new reactions was laid in 1967 by Balaban, who applied the graph theory for systematical enumeration of pericyclic reactions proceeding through a 6-membered transition state2. In the 1970s, these studies were significantly expanded by Hendrickson3, Arens4,5,6, Zefirov, and Tratch7,8 who considered various formal schemes describing bonds redistribution for different types of pericyclic reactions. Another approach implemented in the IGOR1,9 and IGOR210 programs concerned the algebraic model of constitutional chemistry developed by Dugundji and Ugi11. This approach supports the hierarchical representation of organic reactions and deals explicitly with heteroatoms and charges, keeps track of rings in molecules10. Its application led to the discovery of previously unknown reactions: the thermal decomposition of α-formyl-oxy ketones1,9, and the formation of a cage molecule from N-methoxycarbonyl homopyrrole and tropone10. Then, an alternative method based on the generation of the complete sets of non-isomorphic spanning subgraphs of a given graph was suggested. With the help of this approach, new carbene reaction12 and two new elimination reactions leading to the formation of synthetically important dienes13 were discovered. The formal-logical approach to organic reactions7 implemented in the SYMBEQ14 and ARGENT15,16 software was used to discover substituted furans14.

Despite great expectations, no significant progress in computer-aided reaction design was achieved; approaches, algorithms, and software tools reported so far have not found any widespread popularity among organic chemists. The work with those tools required both extensive knowledge in synthetic organic chemistry and a well-developed intuition to turn abstract schemes of bonds redistribution into specific chemical reactions with particular reagents, catalysts, and experimental conditions. This explains why all reactions computationally discovered so far were relatively simple (mainly thermal pericyclic reactions). We believe that real progress in the discovery of new chemical reactions can be achieved by deep learning from big data17. Recently, Segler et al. reported a chemical synthesis planning system based on deep neural networks and symbolic AI trained on a big collection of known synthetic reactions18. This tool, however, implements automatic extraction of transformation rules (“templates”) from known chemical reactions and therefore, in principle, cannot “suggest” not yet seen transformations. Several template-free techniques based on recurrent neural networks and transformers were successfully implemented. They operate in sequence-to-sequence translation mode19, in which SMILES of products were directly predicted from SMILES of reactants20,21 and vice versa22,23,24. Interesting chemistry knowledge driven approaches aiming to predict organic reactions outcomes from given reactants were proposed by Coley et al.25 and in Baldi’s group26, 27. Although, discovery of new chemical transformations cannot be excluded, this is not an objective of such type of calculations. To our knowledge, no new types of chemical reactions resulted from the “reactants-to-products” models were reported in the literature so far.

Generative models based on recurrent deep neural networks were successfully used to generate novel chemical structures28,29,30,31,32,33,34,35,36,37. Recently, we have demonstrated that the structures of molecules possessing desirable properties could be generated using a combination of autoencoder with Generative Topographic Map built on the latent vectors26. In order to apply this approach to chemical reactions, they must be encoded by SMILES strings. However, conventional reaction SMILES can hardly be used because: (i) they are much longer, and (ii) atom-to-atom mapping (AAM) needed for reaction center identification, adds a further layer of complexity. The autoencoder would have to learn not only semantics and syntax of SMILES but also the AAM rules.

Earlier, we showed that in silico chemical reaction handling can be significantly simplified by the Condensed Graph of Reaction (CGR) approach38, in which the structures of reactants and products are merged into a single r graph (Fig. 1). The CGR edges correspond either to standard chemical bonds or to “dynamic” bonds describing transformations. In such a way, one can consider a CGR as a pseudomolecule for which some types of molecular descriptors can easily be computed followed by their application in data analysis and statistical modeling tasks39. Thus, this approach was successfully applied to similarity searching in reaction databases38,40, building quantitative structure–reactivity models41,42,43,44, assessment of tautomer distributions45,46, prediction of activity cliffs47, classification of enzymatic transformations48, prediction of reaction conditions49,50, etc. Here, for the first time, we introduce dedicated SMILES strings encoding CGRs (SMILES/CGR), see their detailed description in Supporting Information. Moreover, the CGR (and, hence, SMILES/CGR) contains information about the reaction center and its close neighborhood51.

Figure 1
figure 1

An example of Suzuki coupling reaction (top) and its condensed graph (CGR, bottom). Reaction SMILES and SMILES/CGR are given underneath. The reaction SMILES features reactants (in orange), and products (in purple). Atom-to-atom mapping is not provided. In the SMILES/CGR broken single bonds are encoded as [- > .] (in red), while the created C–C bond is [. > -] (in green). The colon (:) represents aromatic bonds. See Supporting Information for the details.

Basically, CGRs are nothing but “molecules” with “exotic” bond orders for the changing bonds—thus, let us “teach” de novo molecular design tools on how to generate new reactions! Following a workflow recently used for the generation of novel molecular structures potentially possessing desirable biological activity30, we have chosen to focus here on the generation of “Suzuki-like” putative chemical transformations. The Suzuki coupling reaction was chosen because this reaction is widely used in organic synthesis, and, therefore, its new variants implying different leaving groups and reaction centers could be of interest for synthetic chemists. From the technical point of view, Suzuki reactions constitute a sizeable part of the USPTO database which assures satisfactory knowledge extraction upon the model training. Reaction center of Suzuki reaction can be represented by a SMILES string BC.QL >  > B.CQ.L (where Q = C, N, O, S, Si and L is a leaving group). In our simulation we expect that AI may suggest realistic and unseen combinations of Q and L.

A sequence-to-sequence neural network with Bidirectional Long Short-Term Memory52 layers trained on SMILES/CGR achieved the ability to convert SMILES/CGR to their latent vectors (“encode”) and back (“decode”). Generative Topographic Mapping (GTM) was used to visualize the latent space in 2D and to detect a cluster mostly populated with Suzuki reactions (Fig. 2). Then, virtual chemical reactions were generated by sampling the targeted zone followed by the decoding of associated latent vectors to SMILES/CGR. Notice that visualization is not strictly required for clusters identification, but may significantly help to choose a cluster from which the sampling is performed.

Figure 2
figure 2

Modeling workflow for generation of new reactions consists of five main steps: (1) training sequence-to-sequence autoencoder on the USPTO database of chemical reactions; (2) building of Generative Topographic Map (GTM) using the autoencoder latent variables and preparation of GTM class landscape; (3) selecting on GTM a zone populated to Suzuki coupling reactions and identification of related autoencoder latent vectors; (4) sampling from the autoencoder latent space and generation of new reactions; and, (5) post-processing step. On the Generative Topographic Map, larger transparency levels correspond to lower density. The color code renders the (binary: Suzuki vs Other) reaction class distribution. Thus, zones in dark blue are exclusively populated by Suzuki reactions, zones in dark red are exclusively populated by other types of reactions; while intermediate colors correspond to reaction space areas hosting both categories, in various ratios. The red circle indicates the zone from which virtual Suzuki reactions were sampled.

Results

Reaction sampling from generative topographic map

A set of 2 424 306 reactions, extracted and curated from the USPTO database53, was rendered as CGRs and then as SMILES/CGR strings used to feed the autoencoder. The latter was trained on some 2 million reactions and validated on 450 thousand reactions. The reconstruction rate (a ratio of correctly reconstructed SMILES/CGR) was 98.4% and 97.8% at the training and validation stage, respectively. This is slightly less than reconstruction rates of plain molecular SMILES by state-of-the-art encoders/decoders, but it can be explained by larger complexity and length of SMILES/CGR and an additional source of error: the errors of atom-to-atom mapping in some entries. SMILES/CGR is intrinsically more difficult to learn, with dynamical bonds, dynamical atoms and formal coordination numbers exceeding atomic valency representing novel degrees of freedom in the syntax. Unbalanced or erroneous entries may pass the standardization protocols and thus negatively impact generated SMILES/CGR quality.Nevertheless, reconstruction rates are robust and although LSTM has relatively short memory compared to some other neural networks architectures like transformers and, therefore, may fail to learn relatively complex structural motifs, the bidirectional LSTM used in our work seems to perform acceptably well.

The latent vectors for 100 000 randomly selected reactions were used to construct a Generative Topographic Map (GTM) using in-house software54. Then the entire USPTO database was projected onto the map, on which several zones predominantly populated by Suzuki reactions were identified, as shown in Fig. 2.

Random latent vectors were sampled from one of these zones with the highest relative population of Suzuki reactions. As expected, the sampling procedure led to virtual transformations of a similar type. Finally, 10,000 text strings have been generated, followed by their analysis using a complex post-processing protocol (Fig. 2). At the structures verification and standardization stage, the CGRtools.v3 tool was used to discard invalid SMILES/CGR and to perform valence and aromaticity check. This reduced the dataset to 1099 reactions (some 11% of generated text strings) in which structures of reactants and products were correct. This value is similar to that (15–20%) observed for the SMILES strings in our previous studies devoted to generation of individual molecules. Clearly, not every latent space vector corresponds to a valid structure. However, since invalid SMILES/CGR can be discarded algorithmically, they are not a liability but a manageable consequence of exploratory sampling.

Also, the USPTO reactions are unevenly distributed in terms of types. Deep learning typically focuses on dense clusters allowing it to extensively capture their associated syntactic rules. The realibility of different combinations of leaving group L and a coupling partner Q in the reaction center BC.QL >  > B.CQ.L suggested by AI depends not only on the training set size but also on its diversity, i.e., on the presence in the training set related examples. The USPTO dataset contains very few reactions with Q = O, S and Si which may explain the relatively high rate of invalid SMILES/CGR strings.

Reaction novelty analysis

The main interest of in silico reaction generation is the proposal of novel reactions that a human mind would not spontaneously think of. However, unlike individual compounds, where novelties can be identified as unique scaffolds or particular structural motifs30, the definition of reaction novelty was not discussed in the literature. The most descriptive part is the reaction center (RC)55, i.e. atoms and bonds directly involved in the transformation. Thus, we consider two levels of reaction novelty: (i) the reaction center is unknown (not present in the training set); (ii) reaction center is known, but its closest neighborhood (1st atoms and bonds near the RC, RC + 1) is new. The latter can be extended to a more distant neighborhood (n atoms and bonds away, RC + n), but in this work, we only focus on the reaction center and the closest neighbors. To decide whether a reaction is novel, these substructural reaction motifs are encoded by a hashing function as reaction signatures and are compared to all signatures extracted from the initial dataset.

Among 1099 reactions selected using the post-processing workflow (Fig. 2), 436 contain new reaction center RC and 30 reactions are novel at first neighborhood level RC + 1. Some generated reactions have two or more distinct reaction centers, i.e. represent multistep transformations. Note that “novelty” defined as the absence of reaction center from the training set data is per se meaningful, as an illustration of the “creativity” of this Artificial Intelligence, i.e. its ability to generate original reaction centers which can be submitted for empirical feasibility assessment to human experts. Unfortunately, “novelty” as the absence of reaction center from both the training set and public reaction databases is not easy to interpret, for it may both mean that (a) such reactions were tried, but failed and thus were not published or (b) reactions were never explored, thus represent a real asset of innovation. The choice not to publish failed reactions is a major drawback in training reactivity models44.

Reactions curation and generalisation

A close look at the generated reactions reveals several serious drawbacks: (i) unbalanced reaction equations, (ii) presence of likely unstable groups (e.g., R3S(=O)H and R–PH(=O)–OR’), and, (iii) transformations which require harsh reaction conditions (e.g., breaking of a C–C bond), or kinetically unfavorable reactions (e.g., cleavage of a leaving group with carbon at attachment point). Some reactions can be corrected or discarded using some heuristic rules (“Chemical Filters”).

Output of unbalanced reactions is a direct consequence of the training set composition: almost all USPTO reactions are also unbalanced, e.g., leaving groups are almost never reflected in the reaction equation present in the database. The application of the CGR technology may implicitly solve that problem. Indeed, within the CGR formalism,heavy atoms in reactants and products are implicitly conserved—as the same graph is simply interpreted differently in terms of dynamical bond status in order to convert it to reagents or to products, respectively. Even if the initial reaction was not stoichiometrically balanced (see example in SI), its CGR representation will be—in so far the conversion of an unbalanced transformation to CGR succeeds to produce the correct CGR of the balanced process. However, as the exact state of the leaving group cannot be deduced from the training set, in silico generated CGRs may occasionally decode into reactions by simply substituting a broken bond by a hydrogen atom, leading to a disbalance in terms of implicit hydrogens.This is seen in the example from Fig. 3A in which the products contain 2 hydrogens more than reactants. Furthermore, the postulated product BH(OH)2 is highly reactive, thus unrealistic. Formally, this is a rather “creative” in silico interpretation of the Suzuki reaction pattern, in which the organic halide R–X is replaced by an amide group: the acyl fragment is assimilated to “R” while the benzylamine is the leaving group X. Formally, a balanced Suzuki process could be formulated as either

$$R - X \, + \, R^{\prime} - B{\left( {OH} \right)_2} \to R - R^{\prime} + \, X - B{\left( {OH} \right)_2}$$
(1)

or, more realistically, with inclusion of the required alkoxy base, typically BuO-:

$$R - X \, + \, R^{\prime} - B{\left( {OH} \right)_2} + \, Bu{O^- } \to R - R^{\prime} + \, BuO - B{\left( {OH} \right)_2} + \, {X^- }$$
(2)
Figure 3
figure 3

Example of generated chemical reaction with a new reaction center as is (A), balanced by the addition of a water molecule as a reactant (B), and its simplified form (C). Notice that the aminobenzylic leaving group suggested by the autoencoder for generated reaction looks unrealistic.

Unfortunately, a sketchily written Suzuki transformation, carelessly ignoring the inorganic leaving groups:

$$R - X \, + \, R^{\prime} - B{\left( {OH} \right)_2}\; \to R - R^{\prime}$$
(3)

converts to a CGR corresponding to

$$R - X \, + \, R^{\prime} - B{\left( {OH} \right)_2} \to R - R^{\prime} + \, X \, + \, B{\left( {OH} \right)_2}$$
(4)

in which the unsatisfied valences of X and B are interpreted by chemoinformatics tools as implicit hydrogens. This explains why the AI tool is inclined to generate formal reactions of type (4), which are nothing but a biased interpretation of Suzuki processes, corrupted by intrisic representation errors in USPTO database entries. The addition of a water molecule as a “formal” basic species leads to a fully balanced reaction (Fig. 3B). Although water is not a perfect base for the Suzuki reaction, it helps to correctly represent the boron-containing leaving group in reaction equations. In silico generated reactions that cannot be “corrected” in this way have been discarded.

We also decided to discard the unfeasible under normal conditions transformations consisting in the cleavage of a C–C bond and assuming a carbon-centric leaving group. Application of these heuristics reduced a considered set of novel reactions to 44 including 31 reactions with new RC and 13 reactions with new RC + 1.

The question arises whether we need to consider explicit chemical structures of generated reactants and products. In our opinion, this is not firmly required if one focuses on the detection of new reaction transformations identified by RC or RC + 1 structural motifs. In this case, a “simplified” reaction in which substrates contain only atoms of reaction center and their closest environment (including second neighbors) could be sufficient, see Fig. 3C. Notice that such simplified reactions correspond to general reactivity patterns. Particular reactants can be selected by chemists as a function of availability, intended conditions, reactivity concerns, etc. For example, the reaction in Fig. 3C looks unfeasible, but it becomes more realistic if the amine leaving group were strengthened by binding to strong electron acceptors (for example, trifluoromethanesulfonyl) or by quaternization.

Notice that the majority of generated reactions have known RC and RC + 1 motifs. All belong to the Suzuki coupling type, as exemplified below.

figure a

Reactions with new reaction centers

31 reactions featuring a total of 13 distinct reaction centers not seen in USPTO were generated (see Table 1 and Table S3 in Supporting Information). Substructural searching with RC as a query in the much larger CAS REACT database (SciFinder) resulted in retrieval of several reactions similar to those “discovered” by Artificial Intelligence. In particular, this concerns reactions with C–Br56 bond formation and C–Si57 coupling, and C–C coupling with N-containing58 and F59 leaving groups, as well as C–O coupling with organosilicon leaving groups. In total, 5 out of 13 new reaction centers discovered computationally, were found in SciFinder reactions (Table S5 in Supporting Information). Since none of them were used for the autoencoder training, these generated reactions were pure “imagination” of AI. Thus, several “novel” reactions (4 in Table 1, 5, and 7 in Table S5) correspond to a quite interesting C-N bond cleavage with amine as leaving group. A similar reaction has recently been discovered experimentally by Weires et al.60 who shown that the formation of amides facilitated nickel-catalyzed cleavage of C–N bonds accompanied by C–C coupling (reaction 12 in Table S3). Experimental analogues of C-Si coupling reactions generated by the model (reactions 8 and 9 in Table 1, reactions 19–26 in Table S3) were found in SciFiner (reaction 9 in Table 1 and 19 in Table S5). In the experiment, bromotriarylsilane was used as template57 (reaction 19 in Table S5) whereas our tool proposed less stable di-substituted silane bromide possibly with heteroatoms surrounding silicon (reactions 8 and 9 in Table 1). The organosilicon leaving group proposed for the C-O coupling (reaction 11 in Table 1) is very similar to that reported by Kori et al.61 Fluoro-Suzuki reaction proposed by the autoencoder (reaction 5 in Table 1) was observed experimentally in the study by Chi et al.62 Reaction 7 in Table 1 is not a coupling but boron substitution by bromine; it has been experimentally discovered using N-bromosuccinimide as a donor of bromine in the study by Thiebes et al56 (reaction 17 in Table S5). However, from the structural point of view, its reaction center looks similar to “classical” Suzuki type reactions (boron substitution by carbon or heteroatom).

Table 1 Examples of simplified Suzuki-type reactions with a new reaction center together with corresponding conventional reaction SMILES notation.

Some of the reactions still look unfeasible, e.g., the O–I compound seems quite unstable (reaction 2 in Table 1). Nonetheless, such compounds are listed as commercially available (e.g. CAS Nos 3240-34-4, 1338247-47-4). Sulfur-containing compounds are generally unsuitable for Suzuki catalysts. Their generation can be explained by an excessive model’s “creativity”, which can be hardly controlled in the employed neural network architecture.

Reactions with a new environment of known reaction centers (RC + 1)

Following the novelty detection procedure, 13 reactions that correspond to 3 known reaction centers but an original first environment (RC + 1) were detected, see Table 2 and Table S4 in SI. Two similar reactions have been found in SciFinder. Although the simplified reaction 1 in Table 2 looks unfeasible, a more suitable leaving group might render it possible. For instance, in hydrogenation conditions, a catalyst can facilitate reductive cleavage of C–O bond (in esters, carbamates, benzyl ethers, etc.) followed by a coupling (as in reaction 10 in Table S6).

Table 2 Examples of simplified Suzuki-type reactions with the RC present in the training set but in a new chemical environment.

The use of alkyl and acyl bromides in C–C coupling in reaction 3 (Table 2), was observed experimentally (see reaction 13 in Table S6 in SI). Reaction 2 in Table 2 looks quite feasible because synthesis of acyl iodides was reported in the literature (e.g., CAS 191340-22-4 and CAS 1332596-80-1) whereas carboniodidates can be provided by some vendors (e.g., Enamine BBV-109267542 or BBV-109267541). Notice that similar reactions with chloroformates have been also found in Reaxys63.

Reactions feasibility assessment

Strictly speaking, reaction feasibility is defined by both kinetic and thermodynamic factors. However, according to the Bell-Evans-Polanyi principle66,67, in a series of similar reactions, the trend of activation energies follows the trend of reaction enthalpies. Thus, favorable thermodynamics, namely reaction enthalpy (∆H), can be considered as weak proof of reaction feasibility. A series of gas-phase DFT calculations were performed to assess ∆H for all simplified reactions with new RC and RC + 1. According to our estimations, almost all reactions are exothermic except for four reactions with Si-containing substrates in which ∆H is positive but close to zero (see Tables S3 and S4 in Supporting Information). This shows that all new computer-generated reactions are feasible, at least, as far as DFT-based thermodynamics estimates can tell. Since DFT is a rather time-consuming method and can hardly be applied for thousands of generated reactions, we also performed a rough estimation of ∆H using the tabulated bond energies in reactants and products68,69,70. Although calculated in such a way reaction enthalpies poorly correlate with the DFT values, they generally provide similar conclusions concerning reaction feasibility (see Tables S3 and S4 in Supporting Information).

Conclusion

Here we present the first attempt to generate new chemical reactions using a combination of Condensed Graph of Reaction, Generative Topographic Mapping, and sequence-to-sequence autoencoder. To feed the autoencoder, special reaction SMILES strings (SMILES/CGR) were conceived and implemented. In order to discard the seemingly unfeasible reactions, a special 4-steps post-processing procedure has been implemented. It includes: (i) stoichiometric balancing of reaction equations, (ii) reduction of substrates structure to their simplified form, (iii) discarding chemically infeasible transformations using suggested heuristics (“Chemical filters”), and (iv) assessment of synthetic feasibility using quantum mechanics calculations. The effectivness of the suggested approach was demonstrated on the example of Suzuki-like coupling reactions. Among generated reactions we discovered transformations with 13 new reaction centers which did not occur in the training set. Five out of 13 transformations were then found in the reaction databases (not used in the model training), thus showing the reliability of our approach to generate new synthetically feasible reactions.

This study reveals that creativity of Artificial Intelligence is rather limited. Deep learning neural networks, at least, in their current state, are not able to invent completely new type of chemical transformations but rather propose unseen and sometimes not trivial variations of existing ones. Thus, in this study novel (in the context of the training set) C–N, C–O, C–S and C–Si bond formation reactions, as well as nitrogen- and sulfur-containing leaving groups have been suggested by the model. We believe that this opens a way to propose putatively new synthetic pathways in a way that is not affected by the bias of human expertise—with all the benefits and the pitfalls this may bring. It should also be noted that compared to theory-driven quantum chemical models, data-driven DNN is much less time consuming and, practically, is not limited by the reactants size. The more data are used in the neural network training, the more realistic the predicted reactions are. Since the sizes of reaction databases are rapidly growing up, deep learning approach has an obvious perspective as a tool for discovery of novel reactions.

Methods

Datasets and data curation

The dataset used in this project comes from United States Patents and Trademark Office database (1976 to 2016) extracted by Lowe53. It contains about 3.5 million reactions. The initial dataset was preprocessed with in-house scripts based on the CGRtools library51. The curation includes the standardization (aromatization and functional group standardization), removal of empty reactions (those where the products and reactants are the same, or no reactants or products are recorded) and reactions with valence errors. For curated reactions, atom-to-atom mapping (AAM) was performed using the ChemAxon Automapper tool which is a part of the JChem toolkit71. The mapped reactions were converted into CGRs and their reaction centers were extracted with the CGRtools. In total, 165 879 different reaction centers were obtained. Since AAM errors lead to incorrect reaction centers, which are usually rare, only highly populated reaction centers were selected. Thus, the resulting dataset consisted of some 2.5 million reactions (approximately 70% of the initial dataset) which corresponds to 300 most frequent reaction centers.

According to our estimations72, the ChemAxon Automapper tool leads to the erroneous AAM for some 25% of USPTO reactions. Most of those concern cycloadditions with complex reaction centers. As far as Suzuki coupling is concerned, this error is around 3%.

Notice that practically all USPTO reactions are stoichiometrically unbalanced. This doesn’t prevent to build Condensed Graph of Reaction, but, in some cases, may lead to erroneous atom-to-atom mapping.

Reaction data treatment

CGRtools library (version 3)51 was used for the reactions cleaning, their transformation to CGRs, conversion of CGRs into SMILES/CGR, and processing of generated SMILES/CGR back into reactions.

SMILES/CGR notation

Generally, SMILES/CGR follows the OpenSMILES rules73. They differ from regular Daylight SMILES in terms of aromatic atoms and ring closure specification and introduce special “dynamic” bonds and atoms characterizing chemical transformations. Dynamic bonds in CGR characterizing chemical transformations have special labels representing changes in bond orders. Dynamic atom corresponds to change of formal charge or radical state of this atom in reaction. Detailed information about SMILES/CGR syntax is given in Supporting Information. SMILES/CGR generation and parsing, including preparation of canonic SMILES/CGR, are implemented into CGRtools Python library51.

Reaction generation algorithm

The network architecture previously applied for molecular SMILES generation30 has been used in this study. It is based on the autoencoder architecture introduced by Xu et al.74. SMILES/CGR transformed into sequences of one-hot encoded characters with padding to constant length (256) were used to feed the encoder. Symbols within square brackets (conventional or dynamic atoms or dynamic bonds) were considered as a single symbol within tokenization. The encoder consists of two bidirectional Long Short-Term Memory (LSTM)52 layers (128 nodes each), while the decoder is composed of two forward LSTM layers (256 nodes each). The bottleneck dense layer between the encoder and the decoder transforms the states of the encoder LSTMs into latent variables to subsequently feed them to the decoder; it consists of 128 nodes. Finally, the decoder outputs are transformed back to one-hot encoded characters via a single dense layer.

The autoencoder was trained in batch mode, where batches of “one-hot”-encoded sequences were generated on-the-fly from training set SMILES/ CGR strings. The Adam optimizer was used for training, initial learning rate was set to 0.005, and batch size was set to 256 samples per batch. The learning rate was reduced during training if there were no improvement in the validation loss for two epochs. The training was terminated after 34 epochs when no improvements in test set reconstruction accuracy was observed. To generate latent variable vectors for eventual decoding, we use the Generative Topographic Mapping method. It is a non-linear dimensionality reduction method that has been successfully used for chemical space analysis54,74,75,76,77,78,79,80,82, comparison of chemical libraries83, building classification43,74,75,77,80,84, and regression85,86 models via activity landscapes, as well as for solving the “inverse” QSAR problem87. The GTM algorithm operates by embedding a nonlinear two-dimensional manifold into a D-dimensional descriptor space and calculating the distribution of objects of initial space on these two dimensions. In this work, we utilize the autoencoder’s latent vectors as an initial descriptor space. Once a map for the entire USPTO database was constructed, the zones corresponding to the desired reaction type (Suzuki reaction) were located, from which the latent vectors for virtual reactions were sampled. These new vectors fed the trained decoder resulting in new SMILES/CGR strings.

Novelty detection

Novelty detection is based on the comparison of hashed reaction signatures corresponding to reaction centers (RC) and their environment between the database of known reaction (here, USPTO database) and the reactions generated by the autoencoder (Fig. 4). Encoding chemical reactions by CGR significantly simplifies RC detection. Thus, substructural motifs involving the reaction center (RC, RC + 1, RC + 2, …) can easily be extracted from CGR (see Fig. 5). Since any operations with molecular graphs are time-consuming, each substructural motif was encoded by a unique hash code51—a reaction signature uniquely identifying given transformation. In this case, the novelty detection is reduced to the comparison of signature (hash code) of a generated reaction with those of known reactions (Fig. 4). The suggested procedure assures fast and precise novelty detection.

Figure 4
figure 4

Reactions novelty detection workflow. Substructural motifs Sgen (RC, RC + 1, RC + 2, …) are extracted from the query CGR and compared with those for known reactions {Sknown}. In such a way, motifs belonging to novel reactions will easily be identified.

Figure 5
figure 5

Preparation of a collection of reaction signatures as hash codes. From a CGR generated from a given reaction, substructural motifs containing reaction center (RC), or reaction center with n neighboring bonds and atoms (RC + n, here n = 1) can be extracted. Each motif is encoded by a hashing function into a unique hash code—reaction signature. The ensemble of unique hash codes for all reactions in the database is stored in the hash table.

Reaction enthalpy calculations

The difference in energies between reactants and products is calculated in several steps88. First, a conformer with the lowest energy is generated for each compound in the reaction using the ChemAxon cxcalc module. Then, the geometry of each compound was optimized using the Priroda16 program with PBE exchange and correlation functional89, and the built-in triple-zeta split valence basis set 3z, which is equivalent to Schäfer’s TZVP basis set90. Relativistic and solvent effects were neglected. The Priroda16 program was chosen as it is one of the fastest DFT software due to the efficient evaluation of density functional exchange–correlation terms based on the electron density expansion91. Final energy values were extracted for optimized structures and used for calculation of reaction enthalpy. The additive scheme for estimating reaction enthalpies was implemented using the tabulated chemical bonds increments68,69,70.