The problem

Transition metal complexes (TMCs) have a key role in many applications, including homogeneous catalysis, medicinal chemistry, and the conversion and storage of renewable energy. The discovery of TMCs with optimal properties for specific applications is challenging owing to the vast size of their chemical space, which makes systematic screening with computational methods unfeasible1. Machine learning and evolutionary learning methods tackle this problem by leveraging data to more efficiently explore such spaces2. However, the application of data-driven methods to the discovery of TMCs is still hindered by a lack of large and diverse ligand libraries, which are needed to span chemical spaces of interest.

The solution

Our approach is based on two key building blocks: the construction of an extensive ligand dataset and the development of a multiobjective genetic algorithm (MOGA)3.

First, we built the ligand dataset tmQMg-L by extracting ligands from roughly 60,000 TMCs previously extracted from the Cambridge Structural Database4. tmQMg-L contains about 30,000 diverse and synthesizable ligands and includes their molecular structure, the metal-coordinating atom indices, and formal charges. Considering the palladium(ii) square planar coordination geometry and monodentate ligands, and constraining the charges, chemical spaces of 1 billion unique TMCs can be obtained with as few as 252 different ligands.

Second, we developed a MOGA to optimize TMCs within these spaces with respect to their polarizability and highest occupied molecular orbital–lowest unoccupied molecular orbital gap (Fig. 1). The MOGA uses uniform crossovers as well as substitution and swapping mutations at the full-ligand level to produce new offspring (Fig. 1b). Parent and survivor selection was performed by ranking individual TMCs according to the non-dominated fronts of the current population. Furthermore, we developed a masking function that excludes individuals with fitness values below the median of the current population. By independently applying this mask to each target property, the search can be guided towards specific regions of the Pareto front (Fig. 1a), and hence we named this approach Pareto Lighthouse MOGA (PL-MOGA).

Fig. 1: The PL-MOGA approach.
figure 1

a, Through use of a fitness masking function to control the aim and scope of the optimization, PL-MOGA enables the directional evolution of TMCs towards specific regions of the Pareto front (where y1 and y2 denote two different optimization objectives). b, PL-MOGA uses full-ligand genetic operations to generate new offspring TMCs in each generation. The crossover is applied before mutation. The chromosome reflects the coordination geometry being considered. M, metal center. © 2024, Kneiding, H. et al.

For a chemical space of 1.37 million square planar palladium(ii) TMCs, we calculated the ground truth to benchmark the PL-MOGA approach, which showed that our method finds solutions at the Pareto front using only a small number of fitness evaluations. The results in terms of TMCs and their properties evolved at the semiempirical level (GFN2-xTB) were in reasonable agreement with those computed using quantum mechanics (density functional theory). Analysis of the explored ligands showed that the PL-MOGA favored large ligands with strongly coordinating moieties, in line with chemical intuition. Furthermore, the final population of 130 TMCs exhibited high diversity, as measured by the Tanimoto similarity coefficient. The use of the masking function was effective in guiding the evolution towards specific regions of the Pareto front, without requiring previous knowledge of the optimization targets. We subsequently applied PL-MOGA to the optimization of TMCs in chemical spaces each containing about 1 billion compounds, leading to the identification of even more diverse TMCs.

Future directions

The PL-MOGA approach can be used to efficiently optimize TMCs for various properties, and Pareto optimal solutions can be obtained after exploring only a small fraction of the chemical space. The directional nature of evolution can be exploited to control the aim and scope of the optimization to explore specific regions of the Pareto front. Furthermore, in combination with the tmQMg-L dataset, wide classes of TMCs with diverse ligands and different metal coordination geometries can be investigated.

Nevertheless, this study focuses on a single coordination geometry and includes only monodentate ligands. Furthermore, optimization with respect to only two simple and easy-to-obtain targets has been examined. Another limitation is that PL-MOGA currently considers only full-ligand crossovers and mutations and does not consider genetic operations at the atom level, such as functionalization5.

We now plan to augment PL-MOGA with methods from machine learning to increase its explorative capabilities and computational efficiency. In particular, we believe that generative models could be used to build ligand libraries to enable the exploration of novel and diverse chemical spaces. Another research direction is the application to problems with more than two optimization objectives and the optimization of more challenging targets, including reaction energies and barriers in catalytic cycles.

Hannes Kneiding & David Balcells

University of Oslo, Oslo, Norway.

Expert opinion

“While machine learning-based search algorithms are all the rage, more and more studies show that genetic algorithms often are just as good or better. However, as most discovery efforts require the optimization of multiple objectives it is crucial that the search be guided effectively. The Pareto Lighthouse method is an interesting approach that could prove useful for many discovery efforts in chemistry and other fields.” Jan H. Jensen, University of Copenhagen, Copenhagen, Denmark.

Behind the paper

This study is a follow-up to our earlier work in which we investigated the prediction of TMC quantum properties using graph neural networks to accelerate high-throughput screening4. The logical next step was to perform direct optimization of TMCs with respect to properties of interest by leveraging ligand data extracted from our previously built TMC quantum graphs. Although it was straightforward to determine the molecular structure and connecting atom indices of the ligands, obtaining their formal charges required the development of an approach based on natural bond orbital theory.

For the optimization method, we first experimented with reinforcement learning and diffusion models, but finally settled on genetic algorithms to ensure the reliable exploration of out-of-sample chemical spaces. This choice was also motivated by my previous experience with genetic algorithms in my Master’s degree, during which I learned how they can be used for the efficient optimization in multimodal landscapes. H.K.

From the editor

“This work by Kneiding et al. stood out to me because it proposes a large dataset of synthesizable ligands, which is used to generate more than a million transition metal complexes. Additionally, a multiobjective genetic algorithm is used to generate transition metal complexes with optimized properties, resulting in the proposal of novel molecules from vast chemical spaces.” Kaitlin McCardle, Senior Editor, Nature Computational Science.