Introduction

The need for converting stable molecules such as carbon dioxide (CO2), methane, or water into useful chemicals and fuels is growing quickly along with the depletion of fossil-fuel reserves and the pollution of the environment1,2,3. Such a conversion does not have a satisfactory solution, so far. In particular, CO2 conversion remains one of the most important societal and technological challenges1,2,4,5,6,7,8.

The general understanding in heterogeneous catalysis is that a stable molecule such as CO2 needs to be “prepared” before its catalytic conversion occurs. This leads to the notion of molecular activation9. However, on one hand, this notion encompasses a very wide variety of processes (adsorption, photo-excitation, application of electric field, etc.) and materials (including compositional and structural variability), and it remains unclear which properties of the catalytic material and the adsorbed molecule determine the final chemistry, what is the relationship between the two sets of properties, and how general this relationship may be. On the other hand, finding the set of descriptive parameters of a catalytic material that characterize the catalytic performance in a particular process, or even in general for a given reactant, would be very valuable, because it would allow us to quickly search for promising candidate catalysts using rational design10,11,12,13,14,15,16,17. We call these properties materials genes. The genes do not necessarily correlate with catalytic activity by themselves. Similar to biological genes, their role depends on the combination in which they occur, and can be either beneficial or detrimental to the catalytic activity.

Several strategies exist to find such properties for a given reaction. One way is to explore the free-energy surface for each catalyst candidate, which is a slow and resource-consuming process, and currently computationally unfeasible for many materials on a high-throughput basis. An alternative approach consists in searching for a correlation between experimentally determined material’s properties and its catalytic performance. Such a strategy requires consistent experimental measurements at well-defined conditions for a set of materials. To the best of our knowledge, such consistent data have not been reported so far for CO2 conversion on semiconductor oxides. Moreover, available publications usually do not report unsuccessful experimental results. These issues and a strategy to address them have been recently discussed in our publication18.

Yet another strategy is to find an indicator of activation, namely, a property of the system that directly indicates the certain catalytic performance of the material10. Indicators are distinguished from materials genes based on a qualitatively different level of computational complexity. The indicator can still be unfeasible or hard for a high-throughput study of hundreds of thousands or millions of materials. However, when it can be calculated for a few tens or hundreds of materials in a reasonable time, these data can then be used to find materials genes that control the value of the indicator. Since a direct search for a relationship between the indicator and catalytic performance of material would also require a consistent set of data of turnover frequency (TOF), selectivity, and yield values, one could instead consider several most promising indicators, find out which materials are good catalysts, and then check which indicators correlate with this observation. This approach also addresses the problem of defining activation in terms of the adsorbed-molecule properties as potential indicators of catalytic activity.

Catalytic conversion of CO2 requires activation of other reactants as well, e.g., molecular hydrogen, water, or methane. In particular, hydrogen can serve as an environmentally friendly reagent that can be produced by water electrolysis or photo-splitting avoiding extra CO2 emissions19,20,21. Also, oxygen vacancies have been proposed as active sites for CO2 conversion on some materials22. Therefore, predictions of catalytic activity of materials for CO2 conversion can be refined based on analysis of activation of other reactants and defects. An additional challenge is to ensure that the useful products, as well as the surface catalytic activity, are preserved under the conditions of activation and subsequent conversion. While the strong C–O double bonds in CO2 can be weakened or even broken by adsorption at a solid surface at an elevated temperature, this may also lead to too strong adsorption or further dissociation of the molecule, so that the catalytic surface is poisoned by carbonate or carbon deposits. Weak adsorption, on the other hand, means no activation.

In this work, we combine first-principles calculations with an artificial-intelligence (AI) method, subgroup discovery (SGD), to identify pristine materials properties that optimize indicators of catalytic CO2 activation. Moreover, SGD allows identifying one or more distinct combinations of materials features (genes) that promote activation. We focus on oxide materials as candidate catalysts. Oxides are structurally and compositionally stable under realistic temperatures and can be less expensive than the traditional precious metal-containing catalysts23,24,25. Activation of other reactants and defects are not considered. As shown below, meaningful predictions can be made based solely on the analysis of the adsorption properties of CO2 on pristine surfaces. This confirms that these properties are good indicators of activation with a viable optimization pathway at least for the chosen class of materials. The Sabatier principle is taken into account by ensuring that the adsorption energy is not too large or too small. In order to ensure reproducibility of our AI data analysis, we provide all necessary metadata (input parameters) and workflow in the easily accessible form of a Jupyter notebook26. We argue that, with the ever-growing importance and complexity of AI, such detailed and tutorial documentation is a necessity of good scientific practice. Our approach is applicable to a wider class of materials and molecules, not limited to oxides or CO2. Our study by no means encompasses all possible mechanisms of CO2 conversion on oxide surfaces, but it offers a clear design path among many possible ones.

Results

CO2 activation

We find that on semiconductor oxide surfaces CO2 is chemisorbed exclusively when the carbon atom binds to surface O-atoms. All other minima of the potential-energy surface are found to be either metastable or correspond to physisorption. Therefore, there are as many different potential chemisorption sites as there are unique O-atoms at the surface. The dataset includes all non-equivalent surface O-atoms on the 141 considered surfaces of 71 materials, which sum up to 255 unique adsorption sites. Among these sites on about 4% (10 out of 255) CO2 prefers to physisorb, i.e., any chemisorbed state is metastable with respect to the physisorbed one. The physisorption can be easily identified by an almost linear geometry of the adsorbed molecule, and a C–O bond distance very close to the C–O bond length in a gas-phase CO2 molecule, 1.17 Å.

We considered six different candidate indicators of CO2 activation, including OCO-angle and C–O bond distance. The bending of the OCO-angle in the adsorbed CO2 molecule relative to the gas-phase value of 180° (linear configuration) has been previously proposed27 and is widely accepted as a good indicator of activation. For gas-phase CO2, it is understood that the C–O double bond is weakened when an electron is added to the lowest unoccupied orbital, because it is of antibonding (π*) character with a concomitant bending of the molecule. There is a one-to-one mapping between the C–O bond length l(C–O) and the OCO-angle in gas-phase CO2δ− for a range of δ > 0 (red curve in Fig. 1). However, this is not the case for the adsorbed CO2 (dots in Fig. 1). There is a subset of adsorbed CO2 that is close to the red line, but there are many cases where l(C–O) is substantially larger for a given OCO-angle. This is in contrast to metal alloy nanoparticle catalysts, where there is a better correlation between OCO-angle and l(C–O)28. Also, a longer C–O bond reflects a weakening and readiness for further chemical transformations. Thus, the bond elongation itself may be an alternative indicator of activation. A look at the adsorbed CO2 structures reveals that, on sites following the gas-phase correlation, the molecule adsorbs in nearly symmetric adsorption structures with nearly equal length of the two C–O bonds. In the other cases one O-atom of CO2 is close to surface cation(s), leading to a pronounced asymmetry of the adsorbed molecule.

Fig. 1: Correlation between the larger of the two C–O bond lengths and the OCO-angle for charged gas-phase and adsorbed CO2.
figure 1

The OCO-angle in charged gas-phase CO2 is shown with the red line, and adsorbed CO2 structures are shown with the dots. Colored dots: blue—adsorption sites from the unconstrained subgroup with OCO < 132°, green—subgroup of sites with l(C–O) > 1.30 Å, black—the remaining samples (see the text). The subgroups obtained with Sabatier principle constraint are marked with “c”.

Other considered potential indicators of activation include Hirshfeld charge29 of adsorbed CO2 (a direct indicator of the charge transferred to CO2), the dipole moment of the surface along the surface normal per adsorbed CO2 molecule (includes charge transfer to the molecule, as well as adsorption induced surface relaxation), the difference in Hirshfeld charges of C and O-atoms in an adsorbed CO2 molecule (indicates the ionicity of C–O bonds), and the difference in Hirshfeld charges of the O-atoms in the adsorbed molecule (indicates asymmetry of the adsorbed molecule)9,29.

Subgroup discovery

To find out which properties (features) of the clean surfaces determine when a given activation indicator is maximized or minimized, we employ the subgroup-discovery (SGD) approach30,31,32,33,34. Given a dataset and a target property known for all data points, the SGD algorithm identifies subgroups with “outstanding characteristics” (see further for the criteria for being outstanding) and describes them by means of conjunction of basic propositions (selectors) of the kind “(f1 < a) AND (f2 ≥ b) AND ...”, where fi is a feature and a, b are threshold values also found by SGD. In the framework of SGD, we call the selected primary features {f1, f2, ...} materials genes. Thus, SGD identifies both the outstanding subgroups and the relevant materials genes for a given target property.

Obviously, the selectors should only contain features that are much easier to evaluate than the target property. In the presented work, the considered features include properties of gas-phase atoms that build the material, and properties of the pristine material (properties of the bulk phase and of the pristine relaxed surface). Overall 46 primary features have been considered. The full list is presented Supplementary Table 3. Our strategy is to provide an almost exhaustive list of features, and use data analytics to select materials genes from this list. Some of these features have been explored previously as descriptors of catalytic activity for semiconducting and metallic oxides35,36,37,38. O 2p-band center features have been shown to correlate with catalytic properties of both semiconducting and metallic oxides35,37. In particular, most of the features (or closely related ones) mentioned in ref. 36, inspired by the work of Grasselli39, are included in our set, except oxygen vacancy formation energy, which is relevant for the oxidation catalysis, while here we are interested in partial or complete reduction. Additional important features in our work (see below) include features related to the polarizability of surface cations, which describe the long-range surface response to charged adsorbates. A subset of features from our list has been recently used successfully for predicting catalytic properties of metallic oxides38, along with additional features relevant specifically for metallic oxides (such as partial electronic state fillings).

The features selected by the SGD are summarized in Table 1.

Table 1 Features that appear in the top SGD selectors (see text).

The outstanding subgroup should satisfy several criteria. It should be statistically relevant; therefore the subgroups of too small size should be penalized. Target-property values (OCO-angle, C–O bond length, etc.) for subgroup samples should be as different as possible from corresponding gas-phase values since their change upon adsorption indicates CO2 activation33. To achieve this, two requirements are imposed simultaneously: (i) The target-property values for subgroup members should be smaller or larger (depending on the target) than a certain value (a cutoff), and (ii) the target-property values are minimized or maximized within the cutoff. The latter condition gives preference to subgroups with smaller or larger target-property values among similarly sized subgroups within the cutoff. The value of the cutoff is a parameter. As it approaches the optimal value of an activation indicator among all data points, additional or alternative materials genes and their combinations leading to stronger activation are identified. We explore the whole range of the parameter for each target property (for OCO-angle—123°, 124°, 126°, 128°, 130°, and 132°; for l(C–O)—1.26 Å, 1.28 Å, and 1.30 Å).

In addition to these criteria, we consider the requirement that adsorption energies are not too strong and not too weak for most of the samples in a subgroup. Strong activation (i.e., strong weakening of the C–O bonds) can be achieved by strong binding to the surface. It is well known that good catalytic performance requires a balanced adsorption strength. This is known as Sabatier principle. In addition to the practical value of identifying subgroups that satisfy this principle, comparison of subgroup selectors obtained with and without this requirement helps to identify combinations of materials features that promote desired changes in target properties and at the same time yield intermediate adsorption energies.

Sabatier principle is reflected by a characteristic volcano-type behavior of catalytic activity as a function of adsorption energy of reactants and intermediates. The position of the top of the volcano depends on particular reactions and conditions. It can be estimated from condition |ΔG| ~ 0, where ΔG is the Gibbs free energy of adsorption. For CO2 adsorption at room temperature and partial CO2 pressure of 1 atm this condition corresponds to about −0.5 eV adsorption energy40. At temperatures around 450 °C (typical conditions for CO2 methanation41) ΔG = 0 corresponds to adsorption energy −1.7 eV41. Therefore, for catalytic conversion at low or moderate temperatures this implies that CO2 adsorption energies should be in the range from between −2.0 and −0.5 eV.

These requirements are implemented in the following quality functions that are maximized during the search for subgroups. In particular, for OCO-angle minimization we use:

$$F(Z)={\theta }_{{{{{{\mathrm{cut}}}}}}}\left[\frac{s(Z)}{s(Y)}\cdot \left(\frac{{{\max }}(Z)-{\alpha }_{g}}{{{\min }}(Y)-{\alpha }_{g}}\right)\cdot u(p)\right]$$
(1)

and for C–O bond maximization the following quality function was applied:

$$F(Z)={\theta }_{{{{{{\mathrm{cut}}}}}}}\left[\frac{s(Z)}{s(Y)}\cdot \left(\frac{{{\min }}(Z)-{l}_{g}}{{{\max }}(Y)-{l}_{g}}\right)\cdot u(p)\right]$$
(2)

where Y is the whole dataset, Z—a subgroup, s—size (number of data points), min and max – minimal or maximal value of the target property, αg and lg are the gas-phase values of OCO-angle and C–O bond distance, 180° and 1.17 Å, respectively, and θcut is the Heaviside step function which is equal 1 if all data points in the subgroup satisfy the cutoff condition and 0 otherwise. Thus, larger values of the quality function F(Z) are obtained for those subgroups in which minimal (maximal) value of a target property is close to the maximal (minimal) value of the whole sampling with respect to the gas-phase value of CO2 molecule. The use of maximum/minimum instead of a median is done to ensure that a target property is optimal for as many members of a subgroup as possible. The gas-phase reference values are usually significantly different from the “chemisorption” subset. Therefore, the term in squared brackets in Eqs. (1) and (2) can noticeably contribute only when the sizes of candidate subgroups are similar.

The term u(p) in Eqs. (1) and (2) is added in order to account for Sabatier principle in SGD framework. We have implemented a multitask quality function, where a factor u(p) increases the quality of subgroups with adsorption energies falling within this range. This is formulated in terms of the information gain34, i.e., reduction of the normalized Shannon entropy. We perform the SGD for each target property both explicitly accounting for the Sabatier principle and without it. The latter case is equal to u(p) = 1 in Eqs. (1) and (2)34.

We note that SGD is qualitatively different from machine-learning classification/regression techniques such as neural networks, kernel regression methods, or decision-tree regression (DTR42) (e.g., random forest). SGD is typically referred to as a supervised descriptive rule-induction technique43, i.e., it uses the labels assigned to the data points (the values of the target property) in order to identify patterns in the data distribution (the statistically exceptional data groups) and the rules defining them (the selectors), by optimizing a quality function which is a function of the distribution of values of the target property43. While there are apparent similarities between SGD and DTR as both methods yield models in terms of physically interpretable selectors (usually, inequalities) on a selected subset of the input features, the analogy stops at this level, as SGD focuses at (and only at) subgroups from the very beginning and says nothing about the data that are not in the subgroup. In contrast, DTR determines a global partitioning of the input space by minimizing a global quality function, i.e., the quality of a single subset is secondary with respect to the resulting quality of all subsets partitioning the whole dataset. In other words, for finding distinct combinations of materials genes driving desirable changes in a particular target property (possibly different combinations leading to the same result), the SGD approach has significantly higher flexibility and reliability. This is demonstrated below for a DTR analysis for our target properties.

The metadata and workflow for the AI analysis are documented in the Jupyter notebook26.

Results of the subgroup discovery

The SGD for OCO-angles was done with Eq. (1) for the quality function, and OCO as a target property, since smaller angles indicate larger charge transferred to the molecular π* orbital. The subgroup selectors obtained with different OCO-angle cutoffs (126°, 128°, 130°, and 132°) with or without the adsorption energy constraint are listed in Table 2 (for more details see the Supplementary Table 4). Analysis of these subgroups reveals that the angle reduction is determined by an interplay of several factors: an electron transfer from the cations to surface O-atoms, delocalization of electron density between cations and O-atoms, and coordination of the surface O-atoms. Without the Sabatier principle constraint, the OCO-angle reduction below 132° is mainly due to the electron accumulation at the O-atom of the clean surface. This is expressed by the conditions of more negative Hirshfeld charge on O-atoms (qO < …), not very low IP of at least one cation (IPmax > …), and increased polarizability of the surface O-atom on which CO2 is adsorbed (C6O > ...). Upon adsorption of CO2, this charge on the surface O-atom is readily available for transfer to CO2. When the Sabatier principle constraint is introduced, the OCO < 132° subgroup also includes sites with a pronounced electron transfer to CO2, but with a lower-energy O 2p-band maximum (M < ...) with respect to vacuum level, and a larger kurtosis (kurt > ...). These conditions imply reduced inter-electronic repulsion around the surface O-atom achieved by partial delocalization of the charge density.

Table 2 Top subgroups and their selectors obtained by minimization of OCO-angle and maximization of l(C–O) with/out Sabatier principle (energies are in eV, distances are in Å, charges are in units of absolute electron charge, polarizabilities are in Bohr3).

At lower OCO cutoffs, the subgroup selectors include coordination descriptors Qi, i = 5, 6. Without Sabatier principle, sites with larger Qi are selected, and vice versa. Larger Qi indicates lower coordination of the O-atom. This reduces electron repulsion and therefore facilitates electron transfer to the O-atom of the clean surface. However, this also increases the bonding strength of CO2 to the surface. This explains why selectors of subgroups obtained with Sabatier principle include the opposite conditions (Q5 < ...).

Other surface features describing electron distribution are related to Madelung potential: electrostatic potential and field (φ1.4, φ2.6, and Δφ = φ1.4 − φ2.6) and distances between the O-atom and surface cations. More open surface structure with larger distances between cations at the O site facilitates charge transfer to adsorbed CO2 molecule, since the Madelung potential from the nearby cations is reduced. This is reflected in the appearance of propositions involving features d1, d2, and d3. For example, for the OCO ≤ 130° subgroups, imposing energy constraint changes proposition (d1 > ...) to (d1 < ...), which implies an increased energy cost for transferring electrons to CO2. Larger electric fields Δφ around the adsorption site imply stronger localization of electron density on O-atoms, and thus also improve the efficiency of charge transfer to the adsorbed molecule.

The smaller OCO subgroups with Sabatier principle also include propositions implying increased polarizability of both cations (C6min > ...). Another support-defining condition is that the radius of the lowest unoccupied orbital for the metal atoms should not be small (r+1 ≥ ...). This requirement is true for most cations with negative electron affinities (Supplementary Fig. 4). Analysis of adsorbed CO2 structures and Hirshfeld charges reveals that this condition together with the higher polarizability of cations at the pristine surface encompasses two scenarios: (i) additional electron transfer to CO2 upon adsorption and (ii) stronger binding between O-atoms in CO2 and surface cations. When scenario (ii) dominates, CO3δ− anion lies nearly horizontally at the surface, and is bound with nearby cations by chemical bonds via its oxygen atoms. Such a structure leads to small OCO-angles in CO3δ− (around 120°), even if charge transfer is limited. Thus, increased bending of adsorbed CO2 occurs due to charge transfer over larger distances and/or distortion of the adsorbed molecule and the surface, both leading to weaker adsorption. The cases where both scenarios are active include the same sites as in the subgroups with elongated l(C–O), as described below.

In order to obtain the subgroups of adsorption sites with larger l(C–O), we performed the SGD with the quality function Eq. (2) and l(C–O) as target property. The results for l(C–O) cutoffs 1.26, 1.28, and 1.30 Å are summarized in Table 2 and Supplementary Table 5. In contrast to OCO, the analysis of the obtained top subgroups shows a much less pronounced or no effect of imposing Sabatier principle on the distribution of adsorption energies within the subgroups. This is because sites with too strong adsorption are excluded based on l(C–O) threshold alone, without the need to introduce the energy constraint. For example, the range of l(C–O) for the top l(C–O) > 1.26 Å subgroup without constraining adsorption energies is the same as for the top OCO < 130° subgroup, but it contains significantly more sites with intermediate adsorption energies.

Electron transfer to an adsorbed CO2 molecule increases both the OCO bending and C–O bond elongation. The main difference between OCO and l(C–O) subgroups is that in the latter an additional mechanism of increasing l(C–O) is in effect, namely a covalent bonding between one O-atom of the CO2 molecule and the nearest surface cation. This can be concluded from the analysis of adsorption geometries, and correlates with the presence of proposition (EAmax ≤ 0.005 eV), selecting cation species that can accept electron density, e.g., from an O-atom in adsorbed CO2 molecule. Other proposition that appears in most selectors of top subgroups is (d2 > 2.14 Å) or (d2 > 2.22 Å)—larger distances to the second nearest cation from an O-atom. Larger elongation of the C–O bond is achieved by the asymmetry of the cation types at the surface, where one can bind an O-atom of the adsorbed CO2, while the other (located further away) cannot. An example asymmetric CO2 adsorption structure is shown in Supplementary Fig. 5.

Other propositions indicate a moderate charge transfer to adsorbed CO2 molecule as in the case of OCO subgroups with adsorption energy constraint. Propositions (M ≥ −8.05 eV), (PC ≥ −9.32 eV) in l(C–O) < 1.26 Å subgroups imply enhanced charge density on the surface O-atoms, since electron–electron repulsion raises energies of O 2p-band states. However, at larger l(C–O) cutoffs the electron transfer is balanced by such propositions as (M ≤ −5.19 eV), (U ≤ −4.92 eV), and (W ≥ 5.10 eV) indicating limited electron transfer. These propositions point to more covalent bonding between cations and surface O-atom. Rather persistent proposition observed in many selectors of l(C–O) subgroups is the limit of minimal charge on surface cations (qmin < 0.48e). It also shows the limitation of the charge transfer from one type of cations to surface oxygen atoms.

In general, we find that subgroups obtained with smaller cutoffs do not have a strong overlap with subgroups with larger cutoffs for OCO. In particular, for subgroups with close cutoffs the overlap can be smaller than 50% of the smaller subgroup (but is never below 30%). Interestingly, for l(C–O) the situation is opposite: subgroups with tighter cutoffs are mostly contained in the subgroups for more relaxed constraints. This means that, while larger values of l(C–O) are mainly controlled by the same or additional genes, smaller values of OCO are due to alternative genes. The overlap of OCO subgroups becomes even smaller when Sabatier principle is included, confirming the absence of a universal mechanism for OCO-angle reduction that is compatible with moderate adsorption energy.

In summary, we find that, while an increased electron density at the O adsorption site is necessary for chemisorption and leads to both OCO bending and C–O bond elongation in an adsorbed CO2 molecule, there are additional actuators for these effects that are different for different target properties. The OCO-angle is in general minimized by increasing electron transfer to the O site. However, this also leads to strong adsorption for many materials (Fig. 2). To satisfy Sabatier principle, the electron transfer to CO2 must be moderate. This is achieved by delocalization of charge density around O sites and/or by distortion of the adsorbed molecule due to the formation of covalent bonds between O-atoms in CO2 and surface cations. The largest C–O bond elongations are achieved when both charge transfer to adsorbed CO2 and the covalent interaction are present, and local geometry around surface O-atom provides the asymmetry in adsorption structure. This mechanism automatically fulfills the Sabatier principle.

Fig. 2: Distribution of adsorption energies (left) and OCO-angles (right).
figure 2

The distribution is shown for the whole dataset (black), for the top subgroups of sites with OCO < 132° angles (blue) and l(C–O) > 1.30 Å (green). The subgroups obtained with adsorption energy constraint are marked with “c.” and shown with dashed lines. The adsorption energy Eads is defined as the difference between the total energy of the slab with adsorbed CO2 and the sum of total energies of the clean slab and an isolated CO2 molecule.

The subgroups found by SGD for the dipole moment induced by CO2 adsorption, its total Hirshfeld charge, and the difference of charges on C and O-atoms significantly overlap with the subgroup of smaller OCO-angles. The subgroup found by maximizing the difference of Hirshfeld charges on O-atoms of an adsorbed CO2 largely overlaps with the subgroup of sites delivering larger l(C–O). In general, these indicators are not better than OCO or l(C–O). Therefore, below we focus on OCO-angle and l(C–O) as indicators of CO2 activation. More details about the other indicators can be found in Supplementary Discussion.

Comparison with experimental results

To address the question which of the discussed properties can serve as an indicator of the catalytic activity, we compare our predictions to reported experimental results (Table 3). It should be stressed that the available experimental data are scarce, and results are difficult to compare quantitatively. We consider thermally and, for completeness, some photo-driven catalysis and thus also include supported metal catalysts with the considered oxides as support. Despite possibly different mechanisms for CO2 conversion in the different types of catalysis, we believe that the properties of adsorbed CO2 molecule can still serve as indicators of catalytic activity. Thus, it is possible that under such a daunting situation a reliable indicator of CO2 activation can still be identified. As described below, our analysis confirms this hope.

Table 3 The catalytic performance of materials which contain the sites from larger l(C–O)) or/and smaller OCO subgroups.

First, we consider materials with the sites from subgroups obtained by minimization of OCO-angle without Sabatier principle constraint27. For quite many materials from these subgroups, independent of the cutoff value, there are no reports of successful CO2 conversion, even when they are used as supports for metal nanoparticles (Table 3). This is explained by the fact that absolute adsorption energies for these materials are above 2 eV (Fig. 2 left, Supplementary Table 4), indicating that their surfaces will be permanently poisoned by carbonate species at low or intermediate temperatures. This means that on materials with these sites hardly any reaction of CO2 conversion can proceed at low, especially room temperature. Moreover, as shown in Table 3, even at increased temperatures, 700–750 °C, the activity of these materials is low. Some of them have been considered as candidates for carbon capture and storage (CaO, SrO, BaO, and Na2O)44, which implies the formation of stable carbonates rather than CO2 transformation. Thus, we conclude that OCO-angle alone is not a good indicator of enhanced catalytic activity in CO2 conversion.

On the other hand, several of the materials with sites from l(C–O) > 1.30 Å subgroups (independent on either with or without Sabatier principle constraint) are known as good materials for CO2 conversion (Table 3) in different reactions proceeding at room or higher temperatures. For these sites, the absolute adsorption energies already satisfy the Sabatier principle (Fig. 2, left), as discussed above. We note that, contrary to what one may expect, there is no correlation between the adsorption energy and the value of l(C–O) (see Supplementary Fig. 5). Although there is a general trend, there are also significant variations in l(C–O) for given adsorption energy.

Interestingly, some of the materials with sites in the l(C–O) > 1.30 Å subgroups were studied as supports for metallic nanoparticles. For instance, Ni/LaAlO3 is a catalyst for dry reforming of methane45 at 700 °C. It was shown that its catalytic performance is higher in terms of CO2 and CH4 conversion rates compared to Ni/La2O3 and Ni/Al2O345. All sites on considered lanthanum (III) oxide surfaces belong to the subgroup of OCO < 132° without Sabatier constraint, whereas the sites on Al2O3 do not enter any of the two subgroups. KNbO3 has been studied only with Pt nanoparticles and as a composite with g-C3N4 in photocatalytic reduction of CO2 into CH446,47. Pt-KNbO3 is ~2.5 times more photoactive than Pt-NaNbO346, whereas the NaNbO3 is known to be photoactive even without nanoparticles48. This seems to suggest that l(C–O) is a good indicator of CO2 activation for both unsupported and supported catalysts even at increased temperatures. Hence, the other materials with the sites from this subgroup are promising new candidates for this task. The most promising materials identified in this work are CsNbO3, CsVO3, RbVO3, LaScO3, RbNbO3, and NaSbO3 as they have the sites from the larger l(C–O) subgroups satisfying the above-mentioned criteria.

There is also a set of materials [ternaries A2+B4+O3 (A = Ca, Sr, Ba, B = Zr, Ti, Ge, Sn, Si) with a perovskite structure] containing both the surfaces with sites from the smaller OCO subgroups without Sabatier constraint and the surfaces with sites from the larger l(C–O) subgroups (Table 3). These two types of sites are located on different surfaces. Thus, based on the above results, a material for which a surface with sites from the l(C–O) > 1.30 Å subgroups has lower formation energy and is more abundant than the surface with sites from smaller OCO subgroups without Sabatier constraint is expected to be a good catalyst. To explore this possibility, we analyze the surfaces of these materials in more detail. Their most stable surfaces are AO-terminated (001) facets containing sites from the smaller OCO subgroup. The formation energies of ABO3-terminated (110) surfaces with larger l(C–O) sites are higher: for BaZrO3, SrZrO3, CaZrO3, and SrTiO3 the differences in formation energies are 0.049, 0.027, 0.013, and 0.037 eV/Å2, respectively. The zirconates and SrTiO3 were found to catalyze the water gas-shift reaction under increased temperatures, 700–1100 °C49. At room temperature the photocatalytic activity of SrTiO3 was found to be significantly decreased50. We attribute the latter finding to the strong carbonation of its most stable surface, which is consistent with the calculated high absolute value of CO2 adsorption energy (−2.4 eV) for this surface. Thus, the activity of SrTiO3 at 700 °C and higher temperatures is consistent with the estimates of the CO2 chemical potential given above. The difference in formation energies of the most stable CaO-terminated (001) surface and the stoichiometric (110) surface for CaTiO3 is less pronounced compared to zirconates and other titanates (CaO-terminated (001) is more stable than the (110) surface by only 0.009 eV/Å2). Thus, the (110) facets, which contain sites from the long l(C–O) subgroup, may be present on catalyst particles at the reaction conditions. This can explain the observed activity of CaTiO3 in CO2 conversion not only at high but also at room temperature. We note that the activity of this material was also attributed to the presence of TiO2 nanoparticles on the surface51 at reaction conditions.

The OCO subgroup that includes most of the known good catalysts and a minimal number of inactive materials is OCO < 132° with Sabatier principle. It contains the sites on discussed above LaAlO3, KNbO3, and NaNbO3 catalysts, but also on non-active YInO3 according to ref. 52 (Table 3). This subgroup contains in addition the sites on a well-known CO2 conversion catalyst Ga2O3. We should mention that the catalytic activity of Ga2O3 has been attributed to its reducibility. According to Pan and coworkers53 CO2 molecules are activated via dissociation on surface O-vacancies. However, in ref. 54 only one Ga2O3 (100) surface was considered for which no energetically stable CO2 chemisorption structures were obtained with the PBE functional. We show in Supplementary Table 1 and Supplementary Fig. 1 that this functional underestimates CO2 adsorption energies. Moreover, in our study we considered also other surfaces and found stable CO2 chemisorption structures on these surfaces. Thus, activation of CO2 on Ga2O3 can indeed proceed on O-atoms as discussed in our study, even without surface O-vacancies. The subgroups with small OCO cutoffs, 123° and 124°, do not contain any sites on known active or non-active catalysts.

OCO < 132° subgroup with Sabatier principle contains a large number of sites with elongated C–O bonds. The overlap of this subgroup with l(C–O) > 1.30 Å subgroups is 19 samples (70% of the latter).

To demonstrate the advantages of SGD over DTR in finding materials genes and their optimal combinations, we have done a comparison of found SGD subgroups with DTR performance for l(C–O). DTR terminal nodes (leaves) with the largest average l(C–O) (Supplementary Figs. 2 and 3) include surface sites on materials prone to extremely strong carbonation (Table 2), and also sites at which CO2 prefers to physisorb, with l(C–O) = 1.17 Å. Also, one cannot check the effect of imposing the constraint as there is no standard way to mix regression and classification in DTR. Thus, DTR in contrast to SGD is not able to separate different activation modes and even fails sometimes in distinguishing activation from non-activation.

Best materials for CO2 reduction among calculated ones

Now those good indicators of activation are identified (OCO with Sabatier principle and l(C–O)), all calculated materials can be ranked according to the value of these indicators (smaller OCO or larger l(C–O) indicate C–O bond weakening and therefore higher catalytic activity, provided adsorption energy is moderate). The resulting list of the most promising catalysts for CO2 conversion is presented in Table 4. Each surface is characterized by maximum l(C–O) and minimum OCO among all inequivalent sites on that surface. The materials with l(C–O) > 1.30 Å are listed in the order of decreasing l(C–O). Materials with OCO < 132° but l(C–O) < 1.30 Å are appended at the bottom of the list in the order of increasing OCO.

Table 4 Best materials and surface cuts for CO2 activation according to the l(C–O) and OCO indicators.

Materials and surface cuts higher up in the list in Table 4 that belong to both l(C–O) > 1.30 Å and OCO < 132° subgroups are the most promising catalysts, followed by materials that belong to one of the subgroups, with the performance decreasing further down the list. Taking into account the number of active surface cuts and Sabatier principle, we conclude that NaSbO3 is the most promising unexplored catalyst for temperatures up to 340 °C (for CO2 pressures around 1 atm). Other A+1B+5O3 type promising materials are KSbO3 (for temperatures up to 110 °C) and RbNbO3 (up to 360 °C) that belong to both subgroups, and LiSbO3 (230 °C), CsNbO3 (260 °C), CsVO3 (110 °C), NaVO3 (130 °C), belonging to one of the subgroups (listed in the order of decreasing performance). There are also several promising A+3B+3O3 oxides with surfaces belonging to one or both subgroups, listed in the order they appear first time in the table: ScAlO3 (up to 550 °C), GaAlO3 (230 °C), GaInO3 (340 °C), rhombohedral InAlO3 (120 °C)—these and other In-containing materials are of course very expensive, but we list them here for completeness, LaGaO3 (210 °C), ScGaO3 (240 °C), YAlO3 (330 °C).

From Table 4 it can be seen that not all promising materials belong to one of the found subgroups. This means that there are other optimal materials gene combinations that are not identified by SGD as statistically significant based on the current dataset. Such combinations may be unique for a given material, or they may be found when more data for different materials are considered. Among these materials the most promising are: InScO3 (up to 430 °C), MgSnO3 (430 °C), CaGeO3 (570 °C), orthorhombic InAlO3 (230 °C), CaSiO3 (420 °C), SrSiO3 (460 °C), SrGeO3 (480 °C), and BaSnO3 (up to 550 °C).

Discussion

We have developed the subgroup-discovery strategy for finding improved oxide-based catalysts for the conversion of chemically inert molecules such as CO2 into useful chemicals or fuels. For this purpose we identified a new indicator of CO2 activation, namely the large C–O bond distance of the adsorbed molecule. This artificial-intelligence approach identifies the materials genes that correlate most strongly with the activation of the adsorbed molecule. Specifically, these are the following clean surface properties: Hirshfeld charges of O-atom at which CO2 adsorbs (qO) and of surface cations (qmin, qmax), surface geometric features [coordination descriptors Qi, i = 5, 6, distances between the surface O-atom and the nearest surface cations (di, i = 1–3)], electrostatic potential and electric field above the adsorption site (Δφ, φ2.6), polarizability and C6 coefficients for surface atoms (C6min, C6O, αmax), radii of HOMO and LUMO of the cation species (r+1max, r+1min, rHOMOmin), ionization potential, electron affinity, and electronegativity of surface cation species (IPmax, EAmax, ENmin), features of O 2p DOS (kurt, M, PC, U), conduction band minimum (CBM), energies of the lowest unoccupied projected eigenstates of surface cation species (Lmax, Lmin), and surface work function (W). The found subgroup selectors predict whether a given candidate material belongs to the class of promising catalysts. The peculiarity of the large C–O bond indicator is that it automatically satisfies Sabatier principle for low and middle-temperature CO2 conversion.

The present study shows also that the previously proposed indicator for CO2 activation, the decrease of the OCO-angle27, is not appropriate and even correlates with strong adsorption so that poisoning by carbonation is likely which may be useful for carbon capture and storage (CCS) but not for carbon capture and utilization (CCU). When Sabatier principle is purposely included in the SGD search for small OCO, found subgroups substantially overlap with large l(C–O) subgroups (70%), although still contain a few sites on inactive materials for CO2 conversion.

The subgroup analysis revealed an alternative mechanism of CO2 activation by adsorption, namely bonding of an O-atom in CO2 with a surface cation(s), combined with only moderate electron transfer from the surface to the molecule, which results not only in reduction of OCO-angles, but also in pronounced elongation and weakening of the C–O bond. Although the latter can be achieved also by a larger charge transfer, it results in stronger binding of CO2 molecule to the surface and poisoning of the catalyst, contrary to the new mechanism. The same new mechanism is revealed when Sabatier principle is included when searching for small OCO subgroups.

We also demonstrated that a standard regression technique (DTR), which gives prediction models in a physically interpretable form similar to subgroup discovery (selectors based on identified descriptor), fails to identify the optimal combinations of materials genes and the activation in general. This failure is traced back to the fact that DTR is a global approach, which minimizes error in the prediction of the value of a target property for the whole dataset. As a result, different combinations of genes leading to the optimal value of the same target property are intermixed, and the combination that leads to the most optimal value is not identified. On the contrary, subgroup discovery finds unique local subsets in the data independent of the rest of the data. This makes it more suitable for identifying different combinations of materials genes that result in activation.

The other four considered potential indicators (charge at the adsorbed CO2, adsorption induced dipole moment, the difference of charges on O-atoms and on C and O-atoms of adsorbed CO2) were found to reproduce the results of SGD obtained for OCO-angles or C–O bond distances with significant overlap with corresponding subgroups.

Based on our results, we propose several new promising oxide-based catalysts for CO2 conversion (Table 4). Although the present work has focused on oxides only, the overall strategy is general and can be applied to any other family of materials. This work also emphasizes the importance of documenting metadata and workflows for AI data analysis in materials science in order to ensure the reproducibility of AI models and data analysis results.

Methods

Ab initio calculations

The calculations are performed using density-functional theory (DFT) with the PBEsol exchange-correlation functional55 as implemented in FHI-aims code56 using ‘tight’ basis sets. The functional is chosen based on a comparison of calculated bulk lattice constants55 and CO2 adsorption energy to the available experimental results and high-level calculations (CCSD(T) and validated hybrid); see Supporting Information (SI) for more details on the computational setup. Nevertheless, it is expected that, because of the large set of systems inspected and the small variations introduced by the functional choice, the main trends will hold even when using another functional.

Studied materials

The dataset includes 71 semiconductor oxide materials, with 141 surfaces. The materials are ternary (ABO3) and binary oxides with metal cations A and B from groups 1–5 (including La) and groups 12–15 of the periodic table. The full list of materials and surface cuts is given in Supplementary Notes, and the dataset is available in ref. 26. In this study we considered only stoichiometric surface reconstructions obtained by atomic relaxation of stoichiometric bulk-like initial surface geometries. While this seems to be a limitation, our results show that indicators of activation calculated with this assumption correlate with experimental activity for known good oxide catalysts. This does not imply that surfaces of these materials do not reconstruct, but that the properties of unreconstructed surfaces can be used as descriptors for catalysis at reconstructed and defected surfaces under realistic conditions. The inclusion of surface reconstructions in the training data will further improve the predictions and will be a subject of future work.

The details of SGD

The SGD was done with the RealKD code (https://bitbucket.org/realKD/), modified to include quality functions described by Eqs. (1) and (2) in which the information gain was defined as:

$$u(p)=1-\left(\frac{-1}{{{{{{\rm{ln2}}}}}}}\right)(p\cdot \,{{{{\mathrm{ln}}}}}(p)+(1-p)\cdot \,{{{{\mathrm{ln}}}}}(1-p))$$
(3)

here p is the number of samples in a subgroup within the required adsorption energy range divided by the total number of samples in the subgroup. Since Shannon entropy is a symmetric parabola-like function around 0.5, we set here F(Z) = 0 for p ≤ 0.5. Also, ln(x) = 0 for x = 0. The search of subgroups is performed using a Monte-Carlo scheme adapted for these tasks34.

The cutoff values x, y, ... used for setting propositions (feature-1 < x, feature-2 ≥ y, etc.) are obtained by k-means clustering, as implemented within RealKD. That is, for a desired number n = k − 1 of cutoff values a set of k representative values of a given feature and k groups (clusters) of the data points are determined that minimize the deviation of all the feature values from the representative values. Thus, each value of the feature in the dataset is assigned to a particular cluster, and the cutoffs are determined as the arithmetic mean between the closest feature values in neighboring clusters. The number k is a parameter, and different k-values can in principle result in different cutoff values. It is worth noting that, due to the stochastic Monte-Carlo sampling, the exact definitions of the subgroups may vary for consecutive runs of the SGD algorithm. We have tested k = 12, 14, and 16 and rerun the algorithm several times for each k. While the results indeed depend on the run and on the k value, the subgroups maximizing the quality function have largely or entirely overlapping populations, and selectors with the same or similar propositions. Here we report selectors that appear most often and have high population and quality function values.

Decision-tree regression

The DTR analysis was performed using Python scikit-learn libraries. DTR is a supervised learning method in which the training set is repeatedly split into patterns (so-called leaves) by means of propositions built from primary features. The fitting of a model is done with respect to the cost function, which encloses the deviation of fitted values of a target property from the actual values. In this study we considered two cost functions—mean squared error (MSE) and mean absolute error (MAE). The search for the most optimal partitioning (the so-called tree) is done with the greedy algorithm. To obtain the most optimal TR model, we used a standard approach for supervised machine learning—leave-one-out cross-validation with respect to the hyperparameters—minimal size of a leaf, maximal depth. The minimal size of a leaf is a bottom threshold of the population of a pattern, since too small size might result in overfitting. Maximal depth is a limit for the maximal number of splits in a tree.