Main

The efficacy of a catalytic process is dictated by the possible transition states, which feature core non-covalent interactions that determine their geometries and energies1,2. Such interactions are often difficult to identify and define because they are energetically weak and sensitive to the molecular properties of every reaction component (catalyst, substrates, reagents, solvent and so on)3,4. This overarching issue in reaction optimization is often exacerbated by subtle connections across several reaction variables, wherein modest structural changes to any or a few of these can have a profound effect on the experimental outcome5,6,7. These factors, combined with the number of dimensions under study in most reactions, are the underlying reasons that optimization is traditionally empirical8,9. This situation is particularly common in the area of asymmetric catalysis, wherein seemingly minor structural variations in any reaction component can have acute and non-intuitive influences on the observed enantioselectivity10. However, it is possible that such mechanistic outliers may be concealed within larger datasets because our pattern recognition skills do not perceive pivotal generalities when reaction situations change.

On this basis, we hypothesized that connecting common mechanistic features through the simultaneous interrogation of all reaction components would provide a holistic view of the key non-covalent interactions responsible for reaction performance. This would enable the transfer of experimental observations to genuinely different substrate combinations with unique catalysts. Here we develop and deploy a workflow that parameterizes all the reaction variables of more than 350 distinct reaction combinations, which allows the development of comprehensive statistical models, resulting in the ability to predict reaction performance for entirely different structural motifs. The workflow includes techniques to probe general mechanistic principles, which provides the basis for transfer learning or generalized identification of the key interactions imparting asymmetric induction.

Asymmetric catalysis is replete with examples of catalysts that can promote disparate reactions through a common mode of activation11,12,13,14. However, when ‘similar’ reactions are attempted, many changes to the precise reaction conditions are often required to obtain the desired reaction performance15,16. These changes can be subtle (that is, one aromatic solvent for another) or more profound (one catalyst class for another). This leads us to ask (1) whether mechanistic insight is transferable to a new reaction in the same subclass, given that a standard mechanistic paradigm may exist with a general mode of activation? If so, (2) how could a data-driven workflow that combines data acquisition and a description of the molecules involved mathematically be used to build a statistical model for diverse and multiple reaction profiles? And if such a workflow is achievable, (3) can the observed conditions of one or more reactions be deployed to predict the performance of another? Such analysis could provide a mechanistic understanding of why certain conditions are effective for a general reaction type and the ability to transfer this information quantitatively to out-of-sample predictions streamlining reaction optimization17,18.

To assess a specific workflow that is designed to probe the questions posed above, it would be pragmatic to compare transformations within a reaction class facilitated by a single catalyst chemotype. Although multifarious reports of the same catalyst class for different transformations exist in enantioselective catalysis, comparative studies—even qualitative rather than quantitative—have been sparse. Such an assessment would be challenging because most datasets, often generated under non-uniform conditions, are incomplete and readily comprehensible descriptors for each varying reaction component need to be developed. To address this correlation challenge, we envisioned a strategy for the interrogation of enantioselective catalysis involving the application of modern data-analysis methods and advanced parameter sets. In this approach, integrated descriptor sets—quantitative structure–activity relationships (QSAR), molecular mechanics (MM) and density functional theory (DFT) derived)19—are related to a relatively large library of outputs collected from a general reaction and catalyst type, which are data-mined from multiple literature sources (see the Supplementary Information). By combining appropriate data-organization and trend-analysis techniques, general relationships between reactions can be established. The ability of the statistical models to predict a new reaction type performance is used as a validation of mechanistic transferability (Fig. 1).

Fig. 1: Workflow for interrogating and applying mechanistic transferability.
figure 1

a, Mechanistic transferability. BINOL-based phosphoric acid catalysed nucleophilic additions to imines as a general reaction for workflow development. b, Prediction workflow. Reaction performance predictions are streamlined by employing a mechanistic transferability strategy implemented by correlating all reaction variables to enantioselectivity. General correlations can be built to reveal the interactions between any reaction component in the relevant transition state and enantioselectivity. The mechanistic principles leading to enantioselective catalysis captured by the statistical models can be transferred to genuinely different structural motifs not contained in the training dataset. Σ indicates the totality of the descriptor categories that were considered.

Reaction platform selection

As a proof-of-concept reaction class, we chose the addition of various nucleophiles to imines owing to the ubiquity of this type of transformation in asymmetric catalysis20,21. This reaction class uses imine starting materials that are easy to obtain and the resulting amine products have broad applicability in both synthetic and biosynthetic settings22,23. As a next step, we evaluated the different catalyst chemotypes used in this reaction class, focusing on those that provide a wide range of both substrate structural types and enantioselectivity data from published sources. With these constraints in mind, we selected the field of chiral phosphoric acid (CPA) catalysis, in particular the addition of protic nucleophiles to imines catalysed by chiral 1,1′-bi-2-naphthol (BINOL)-derived phosphoric acids bearing aromatic groups at the 3 and 3′ positions (Fig. 1)24.

To initiate this workflow, an expanded inventory of 367 reactions with varied components was curated from multiple reports (for a list of references, see Supplementary Information). From this survey, we categorized the dataset by imine transition-state geometry (E or Z) wherein E-imine transition states have a +e.e. value and Z-imines have a −e.e. value. Imine stereochemistry was determined by the enantiomer of the product formed if the imine was derived from an aldehyde. However, if ketimines (imines derived from ketones) were employed, we also needed to consider substituent size if the smaller C-substituent has higher Cahn–Ingold–Prelog (CIP) priority25,26. For the reactions we studied here, this affects only ketimines that have either a trifluoromethyl or ester C-substituent, which are considered to have lower priority for the purpose of assigning an E or Z transition state. This is important in understanding product enantioselectivities, because nucleophile addition to the same face will yield opposite enantiomers for the E and Z configurations. Therefore, the models developed will not be capable of predicting product stereochemistry but can be deployed to predict whether a reaction will proceed via an E- or Z-type mechanism and this information can be used to determine absolute configuration.

Simultaneously, we collected a diverse array of molecular descriptor values from DFT-optimized geometries to describe the structural features of each imine, nucleophile, catalyst and solvent. Unfortunately, the lack of structural commonality for particular molecular subsets creates a challenge in identifying readily comprehensible and extensive parameter sets for each component. For example, when comparing substrates and catalyst structures, it is apparent that they have overlapping and distinctive features that are probably required for determining selectivity patterns (Extended Data Fig. 1). By contrast, the solvents do not have common substructures, yet are critical for enantioselectivity.

To address this limitation, we explored two approaches: (1) we collected parameters derived from DFT calculations, which satisfactorily describe molecules containing common structural features including Sterimol parameters, bond lengths, angle measurements, molecular vibrations and intensities, natural bond orbital (NBO) charges, polarizabilities, highest occupied molecular orbital (HOMO) and lowest unoccupied molecular orbital (LUMO) energies27,28. We collected these parameters for both the reaction partners and the catalysts. (2) We used two-dimensional descriptors (such as topology and connectivity as exemplified by molecular shape, size and number of heteroatoms) because this is a traditional method of assessing structurally disparate molecules such as solvents29,30. Other reaction variables, such as concentration of reagents or catalysts and inclusion of molecular sieves, were also included as categorical descriptors (see Supplementary Information).

Comprehensive model development

Linear regression algorithms (see Supplementary Information) were then applied to the entire dataset (367 reactions) to identify correlations between the molecular structure of every reaction variable defined by the parameters collected in the previous step of the workflow and the experimentally determined enantioselectivity. ΔΔG = —RTln(e.r.) (where e.r. is the enantiomeric ratio, T is the temperature at which the reaction was performed and R is the gas constant) was regressed to an equation to reveal a surprisingly good correlation despite the large structural variance included in the training set. Both cross-validation analysis (leave-one-out (LOO) and k-fold) and external validation, in which the dataset is partitioned pseudorandomly into 50:50 training:validation sets, suggest a relatively robust model (see Supplementary Information). The model emphasizes solvent (black), imine (blue), nucleophile (green) and catalyst (red) terms distributed over six parameters, as contributors to the enantioselectivity across these seventeen reaction types (Fig. 2a). A slope approaching unity and intercept approaching zero over the training set indicates an accurate and predictive model with a goodness-of-fit R2 value of 0.88, demonstrating a high degree of precision. The largest coefficients in this normalized model belong to the imine NBO descriptors, indicating the crucial role of the imine substrate in the quantification of enantioselectivity as highlighted by the formation of both enantiomeric products, a consequence of active E and Z configurations (see below). A comparison of two Strecker reactions performed under uniform conditions results in values ranging from +99% enantiomeric excess for the enantiomer that proceeds through the E-imine transition state and −80% enantiomeric excess for the Z-imine transition state. Remarkably, this represents a 3.5 kcal mol−1 energy range, based solely on imine structure.

Fig. 2: Comprehensive model development.
figure 2

a, Comprehensive regression model containing 367 data entries facilitated by parameterization of every reaction variable. ‘sol’ is the solvent term, ‘NBON’ and ‘NBOC’ are imine natural bond orbital parameters, Ls is a steric descriptor of the smallest imine substituent, ‘H–X–CNu’ is the nucleophile angle measurement and Lcat is the length of the catalyst 2-substituent. A positive percentage enantiomeric excess (% e.e.) value indicates the E-imine transition state, and a negative percentage enantiomeric excess value indicates the Z-imine transition state. The line is a fit, y = 0.88x + 0.05. The leave-one-out (LOO) cross-validation score is 0.87; the average k-fold (here, fourfold) cross-validation score is 0.87; the goodness of fit R2 is 0.88; the predicted R2 is 0.87. b, Test of mechanistic transferability in the dataset via leave-one-reaction-out (LORO) analysis. Distinct reactions (as determined by individual publications) are defined as the validation set. The line is a fit, y = 0.84x + 0.12. R2 is 0.84; the R2 predicted using LORO (here, seven reactions were left out) is 0.85. c, Visual analysis and interpretation of the model terms (coefficients are shown).

We postulated that the ability to correlate and predict using a singular model for an array of reactions suggests that the transition-state features are fundamentally similar within this reaction range. Perhaps the best test of this hypothesis could be achieved by a ‘leave one reaction out’ (LORO) analysis. In this statistical evaluation, the catalyst, imine and nucleophile structures are varied as a validation set and assessed through the ability of the model to predict with sufficient accuracy. This would report on the model’s capacity to match patterns across a general reaction type. Using this analysis, each distinct reaction (as determined by individual publications) in the data field was evaluated, with most predicted well (see Supplementary Information). As an illustration of model robustness, we could exclude up to seven reactions with little change in the correlation statistics (Fig. 2b). However, not surprisingly, some reactions were poorly predicted using the LORO protocol, which can be attributed to the model’s inability to capture specific structure changes if they are not adequately expressed in the training set. In sum, the descriptor definitions coupled to the model and validation strategies do demonstrate that patterns can be matched. This is consistent with the hypothesis that a defined set of key non-covalent interactions impart asymmetric induction across a general reaction type. Essentially, this workflow provides evidence that one reaction can be used to predict the results of another, quantitatively.

Trend analysis

Although the comprehensive model in Fig. 2 establishes the capacity of the selected parameters to describe general aspects of this system, the ultimate goal of our workflow is to discern subtle underlying mechanistic phenomena. This objective could not be achieved by using the above correlation because it was produced by using the entire dataset, which provides only an overview of the mechanistic patterns. We hypothesized that a series of focused correlations, coupled with an evaluation of the overall trends, might serve to reveal fundamental features of the systems. To this end, we truncated the dataset into subsets, categorized by imine transition-state geometry (E or Z) determined by the relative sign of the enantiomeric excess defined previously, as these are hypothesized to lead to structurally distinct interactions with the other reaction components. This organizational scheme was viewed as a means of facilitating the identification of catalyst features that affect particular mechanistic pathways and therefore, reactant combinations (and vice versa). Linear regression algorithms were then applied to this data classification to identify correlations between molecular structure and the experimentally determined enantioselectivity. Subsequently, analysis and refinement of the resulting models were used to produce explicit mechanistic hypotheses (Fig. 3).

Fig. 3: Development of focused correlations.
figure 3

a, Regression E-imine model containing 204 entries data-mined from nine literature sources (see the Supplementary Information for references). ‘CI’ and ‘PEOE5’ are solvent descriptors, ‘B5PG’ and Ll are the imine steric descriptors, LUMO is the lowest unoccupied molecular orbital energy describing the nucleophile, Lcat is the length of the catalyst 2-substituent, ‘iPOas’ is the P‒O asymmetric stretching intensity and ‘AREA’ is a remote environment angle. The line is a fit, y = 0.80x + 0.35. The LOO cross-validation score is 0.76; the average k-fold (here, fourfold) cross-validation score is 0.74; R2 is 0.80; the predicted R2 is 0.73. b, Interpretation of E-imine model terms. The model emphasizes the importance of both steric and electronic factors. Reasonably large catalyst and imine substituents lead to high levels of enantioselectivity; if these two components are matched any nucleophile should be compatible. c, Regression Z-imine model containing 147 entries data-mined from eight literature sources (see the Supplementary Information for references). ‘NBOH’ and ‘NBOPG’ are the imine natural bond orbital parameters; Ls is a steric descriptor of the smallest imine substituent; ‘B5Nu’ is the nucleophile steric descriptor and ‘B1cat’ is the Sterimol B1 term. The line is a fit, y = 0.83x − 0.24. The LOO cross-validation score is 0.80; the average k-fold (here, fourfold) cross-validation score is 0.79; R2 is 0.83; the predicted R2 is 0.80. d, Interpretation of Z-imine model terms. Overlapping steric terms describing the catalyst and the imine reinforce the notion that similar interactions remain within the two geometric imine stereoisomers. However, this model emphasizes the importance of steric contributions predominantly from the nucleophile for high enantioselectivities.

The correlation depicted in Fig. 3 was identified from a set of 204 reactions (evenly split into training and validation sets) that proceed via the E-imine transition state. The relationship includes two solvent, two imine, one nucleophile and three catalyst terms. Overall, the statistical model suggests a mechanistic scenario in which the imine adopts an arrangement that minimizes energetically penalizing repulsion interactions with reasonably large catalyst substituents31. Perhaps most telling is that the steric profile of the nucleophile does not have much effect on the stereoselectivity outcome, despite the large structural variance. The included parameters (LUMO and the P‒O asymmetric stretching intensity, iPOas) suggest that hydrogen-bonding contacts between catalyst and nucleophile play a minor part and the use of almost any nucleophile should be compatible with the reaction if the imine and catalyst are matched.

In evaluating the model for Z-imines determined by 147 reactions, a number of overlapping terms reinforce the notion that similar interactions between catalyst and substrates remain within the two geometric imine stereoisomers. Two of these terms—the size of the catalyst aryl substituent as measured by the Sterimol B1 term and the imine NBO parameter—essentially describe the repulsive interactions between proximal sterics and the imine N-substituent, a critical catalyst–substrate interaction common to both transition-state imine configurations. The most compelling difference between the two models is that the Z-imine model includes an important nucleophile steric descriptor, which is the most highly weighted term in the equation. This suggests that larger nucleophiles introduce enhanced repulsive interactions with the catalyst substituents in the transition state, leading to the competing product, which ultimately favours the observed enantiomer. This claim is further supported by the observation of high enantioselectivities when using catalysts with smaller substituents (for example, Ar = 3,5-(CF3)2C6H3). The proposed physical meanings of each term in the mathematical equations have been summarized in Fig. 3.

Evaluation of prediction capabilities

As a final step in the workflow, we evaluated the ability to transfer the mechanistic principles leading to enantioselective catalysis captured by the statistical models to genuinely different structural motifs not contained in the training dataset. If effective out-of-sample prediction were possible, the model could predict the impact of a new imine, nucleophile and/or catalyst. Initially, reaction performance was evaluated using the comprehensive model to determine the mechanistic pathway under operation, and these predictions could then be further refined with the specific models (E or Z). This two-tiered workflow is imperative because the process avoids mechanistic assumptions about whether the reaction proceeds via an E or Z transition state, thus ensuring that the results of the test reactions are unknown. The comprehensive model does not immediately allow prediction of stereochemistry; however, product configuration can be assigned from the simple models shown in Fig. 4. These are based on the amine product yielded from a reaction proceeding via an E or Z transition state and catalysed by the (R)-CPA. The opposite enantiomer will be formed if the (S)-CPA is employed as the catalyst. As a first case study, we evaluated fifteen additional reactions involving enecarbamates, a nucleophile not contained in the training set, and benzoyl imines, an imine subclass that is part of our initial training set32 (Fig. 4). Each result was predicted using the comprehensive model, with an average absolute ΔΔG error of 0.37 kcal mol−1 (13 examples within 5% enantiomeric excess) and the absolute stereochemistry correctly assigned as R, demonstrating the ability of the model to extrapolate effectively to a new nucleophile. A slightly improved outcome is observed using the E-imine mechanistic model with an average error of 0.24 kcal mol−1 (all examples within 5% enantiomeric excess).

Fig. 4: Out-of-sample predictions using two-tiered prediction workflow.
figure 4

Comprehensive model first determines the E or Z transition state, configuration specific models are then used to refine predictions. A generic amine product denotes the stereochemical outcome predicted if the reaction proceeds via the E or Z transition state and is catalysed by an (R)-CPA. Product stereochemistry is reversed if (S)-CPA is used. a, Out-of-sample prediction. Application to addition of enecarbamates to benzoyl imines and transfer hydrogenation of alkynyl ketimines. DCM, dichloromethane; RT, room temperature (25 °C). b, Out-of-sample prediction and extrapolation. Prediction of TCYP, which has cyclohexyl groups at the 2,4,6 positions of the aromatic ring, to be a highly selective catalyst for the addition of thiol to benzoyl imines.

As the second case study, the hydrogenation of alkynyl ketimines catalysed by H8-BINOL where the 3,3′ groups = 3,5-(CF3)2C6H3 was predicted33. This is a more challenging scenario as both imine and catalyst components are not included in the training set. Again, accurate prediction of the outcomes was construed using the Z-imine mechanistic model, with an average absolute error of 0.30 kcal mol−1 and 13 examples predicted within 2% enantiomeric excess (Fig. 4). The stereochemical outcome was correctly determined to be R with the (S)-catalyst. Although the comprehensive model assesses the mechanistic scenario and therefore assigns the stereochemical outcome, it was not as accurate because the nucleophile information was categorical (symmetrical or displaced). Therefore, the beneficial effect of a large nucleophile for a Z reaction was not adequately captured. These examples showcase that the predictive capabilities of the model are not limited to classifying the vast literature, but can be applied to analyse and predict new reactions even in situations where multiple components are varied.

As a final case study, we evaluated a recently reported reaction that was rendered highly predictable by application of machine learning algorithms. The study reported by Denmark and co-workers34 involved the addition of thiols to benzoyl imines, a distinct reaction included in our training set. To utilize machine learning approaches, they performed 2,150 separate experiments using 43 catalysts to yield 25 different products (5 × 5 nucleophile/electrophile matrix). We postulated that our approach could reliably predict their results, including the best catalyst, TCYP (2,4,6-tricyclohexyl phenyl phosphoric acid), a CPA that is not in our training set. To test this hypothesis, all experimental results of this reaction type were removed from our original training data, the model was retrained, and deployed to predict their new dataset (34 reactions) collected with the best catalyst, TCYP. We conclude that our model—which lacks experimental data on this reaction—can also predict the enantioselectivities (average absolute ΔΔG error = 0.65 kcal mol−1 comprehensive model (26 examples within 5% enantiomeric excess), 0.67 kcal mol−1 E-imine-only model (25 examples within 5% enantiomeric excess)), confidently determining the stereochemical outcome to be R and TCYP to be a highly selective catalyst. Overall, through the combination of results generated from the out-of-sample prediction platforms, we can conclude that the E- and Z-focused correlations generate more accurate predictions but that the comprehensive model is valuable because it determines which equation should be deployed.

Here we have introduced a workflow with which to model enantioselectivity in assorted catalytic systems. The value of this approach is that complicated reaction conditions can be accounted for and successfully evaluated for multiple and diverse reactions. The ability to correlate and predict enantioselectivity using a single model that covers many reactions suggests that general transition-state features are fundamentally similar across the reaction range, allowing the transfer of observed reaction conditions from one reaction to another. This finding suggests a probable general phenomenon in asymmetric catalysis, whereby various transformations may be found to perform in the same manner when exposed to similar reaction conditions. Through the development of mechanism-specific correlations, such reaction similarities and reaction-specific mechanistic principles may be revealed.

Methods

After the database of the reactions was constructed, the experimental output—enantiomeric ratios—were mathematically modelled through linear regression techniques to reveal which of the proposed parameters allow for the prediction of new outcomes. The detailed acquisition of parameters and the descriptor tables can be found in the Supplementary Information. The models produced were evaluated for their goodness of fit, R2, and their robustness is demonstrated by external validation of the goodness of fit, the predicted R2. The nearer the R2 and slope values are to 1 (indicating a tight, one-to-one correlation between predicted and measured outcomes) and the nearer the intercept is to zero (indicating minimal systematic error), the more robust the model. Potential models were refined through number of parameters, because this allows for a mechanistically informative interrogation and cross-validation scores. LORO analysis was performed to probe general mechanistic principles, which provides the basis for mechanistic transfer of experimental observations and tested further by predicting out-of-sample.