Introduction

Superconductors exhibit zero resistivity and perfect diamagnetism. These traits lend them useful for various important technologies, including Maglev trains, MRI magnets, power transmission lines, and quantum computers. However, a major current limitation is that the superconducting transition temperatures (\(T_c\)) of all known superconductors at ambient pressures are well below room temperature, restricting their broader practical application. Consequently, the search for superconductors with higher \(T_c\) is a very active field, as they have significant potential to considerably improve the efficiency of current technologies while also enabling new ones.

Currently, however, superconductivity in high \(T_c\) superconductors is not very well understood. As a result, there exists no systematic method for searching for new high \(T_c\) superconductors1, and the most common method for searches for new high \(T_c\) superconductors is essentially trial-and-error. For instance, the study in Hosono et al.2 surveyed approximately 1000 compounds over four years, of which they found only about \(3\%\) to be superconducting. That study is a testament to the extreme inefficiency of finding new high \(T_c\) superconductors through pure manual search.

Understanding this, more recently, computational techniques have been applied to assist researchers in the search for new high \(T_c\) superconductors. Specifically, a number of works have applied machine learning to this search for superconductors. Although serving as very valuable tools in many respects, most of these attempts3,4,5, have been limited to classification and regression models, which only search through existing databases and are not able to generate any new compounds. Only recently, with deep generative models applied to superconductor discovery, have new hypothetical superconductors not found in most popular compound datasets been generated6,7,8. In Kim and Dordevic6, a Generative Adversarial Network (GAN)9 was applied for unconditional high \(T_c\) superconductor generation, and in Wines et al.7, a Crystal Diffusion Variational Autoencoder (CDVAE)10 was also applied for unconditional superconductor generation so that crystal structure could be accounted for; however, that work used a different dataset and focused on the different task of generating stoichiometric Bardeen–Cooper–Schrieffer (BCS) conventional superconductors11 and so did not generate any superconductors with \(T_c \gtrsim\) 20 K.

New attempts at high \(T_c\) superconductor discovery with generative models are not without limitations, however. Most notably, although past models have been able to successfully generate new superconductors within existing superconductor families, they have not been able to generate completely new families of superconductors, which would be particularly desirable. This is because they are only unconditional models, which learn only the training dataset distribution. As unconditional models, the generation process of these models cannot be controlled. In other words, past models lack conditioning functionality—a method for controlling the generation process, that, in this context, means giving an example superconductor, the reference compound, and having the model generate similar superconductors, ideally by interpolating between the example and what the model has learned from the training dataset. With conditioning, the possibility of generating new families of superconductors can be opened, and researchers can be given control over the generation process. This can be especially useful for researchers looking to find only specific types of superconductors or expand on their own new discoveries. Parallel to our work, Zhong et al.8 also applied a diffusion model for high \(T_c\) superconductor discovery; however, their model was, like previous GANs, greatly limited by its lack of support for conditional generation with reference compounds—which is our main focus. Thus, their diffusion model shared with previous models the major limitation of being unable to generate any new families of superconductors—essentially, their work was only recreating the performance of the GAN in Kim and Dordevic6 but with a diffusion model instead and added \(T_c\) label control only. Once again, we note that, in this work, we consider “conditioning” to mean conditioning the model on reference compounds only, as only this allows for the controlled generation of known and new families of superconductors. Moreover, Kim and Dordevic6 also struggled at generating unique (distinct from others in the given generated set) pnictides because of the small number of pnictides in SuperCon, the training dataset.

To resolve these limitations, in this work, we implement a Denoising Diffusion Probabilistic Model (DDPM)12,13 for superconductor generation as our unconditional model and further implement conditioning with the Iterative Latent Variable Refinement (ILVR)14 extension to DDPM, which allows for one-shot generation without additional training. With conditioning, we hope to be able to generate new families of superconductors for the first time, as identified by the clustering analysis proposed in Roter et al.15, by experimenting with feeding the model different reference superconductors—this would mark a leap in the capabilities of computational searches for superconductors.

Diffusion models are a class of deep generative models that are inspired by nonequilibrium thermodynamics13 and have recently shown superior performance and outperformed GANs in image synthesis16 and materials discovery17. Diffusion Models are also at the heart of popular new image generation software, such as DALL\(\cdot\)E 218 and Stable Diffusion19. More recently, these models have also been implemented and shown considerable promise for a variety of scientific applications, such as for drug discovery20.

We coin this first approach to conditionally generating new superconductors with reference compounds as “SuperDiff”. With SuperDiff, we aim to resolve the issues found in past works as a result of the small pnictide training dataset with the unconditional DDPM and, as our main focus, explore how the conditional DDPM can adapt to new information to generate completely new families of superconductors for the first time.

Methods

As stated in the introduction, we leverage the capabilities of Denoising Diffusion Probabilistic Models and Iterative Latent Variable Refinement to propose a method for conditionally generating new hypothetical superconductors. Here, we discuss the details of the creation of SuperDiff by discussing the sourcing and processing of superconductor data, providing a brief overview of the details of the underlying DDPM and ILVR methods used, and discussing the techniques we use to evaluate the quality of SuperDiff outputs.

Data processing

All data for the model was sourced from SuperCon21, which is the largest database for superconducting materials. The dataset was processed by the steps in Kim and Dordevic6 and, like in previous studies4,6,15, only the chemical composition data was used. Every compound from SuperCon was represented as a column vector for input into the model. As shown in Fig. 1, each compound was encoded as a \(96 \times 1\) column vector as 96 is the maximum atomic number present in the dataset.

Figure 1
figure 1

The column vector encoding method used. The figure shows the chemical composition of \(\mathrm {HgBa_2Ca_2Cu_3O_{8.27}}\) being encoded as a vector in \(\mathbb {R}^{96}\) which is fed to the diffusion model as \(\textbf{x}_0\).

Denoising diffusion probabilistic model

Figure 2
figure 2

Overview of the unconditional DDPM used. Compounds are encoded as vectors in \(\mathbb {R}^{96}\); however, for illustration purposes, the vectors are represented as \(16 \times 6\) pixel images in this figure, where each pixel in the image represents an element of the vector, starting from the top-left corner and proceeding horizontally row by row. Whiter pixels represent more positive values (all values are divided by the maximum element of \(\textbf{x}_0\)), and redder pixels represent more negative values (black is zero). Starting from noise \(\textbf{x}_T\), the model generates a compound \(\textbf{x}_0\) by denoising \(\textbf{x}_t\) iteratively. Note that \(\mathrm {YBa_{2}Cu_{3}O_{6.91}}\) was picked from SuperCon for illustration purposes only, and is not a compound generated by SuperDiff.

Denoising Diffusion Probabilistic Models (DDPMs)12,13 function by learning a Markov chain to progressively transform an isotropic Gaussian into a data distribution. The general structure of the DDPM used is shown in Fig. 2. The DDPM consists of two parts: a forwards “diffusion” process that adds noise to data, and a generative reverse process that learns the reverse of the forwards process—“denoising” the forwards process. The forward process is a fixed Markov chain that gradually adds Gaussian noise to data. Each step in the forward process is defined as

$$\begin{aligned} q(\textbf{x}_t | \textbf{x}_{t-1}) := \mathscr {N}(\textbf{x}_t; \sqrt{1-\beta _t}\textbf{x}_{t-1}; \beta _t\textbf{I})\, , \end{aligned}$$
(1)

where \(\beta _1,\ldots , \beta _T\) is the variance schedule, \(\textbf{I}\) is the identity matrix, and \(\textbf{x}_0\) is dimensionally equivalent to latent variables \(\textbf{x}_1,\ldots , \textbf{x}_T\) (all vectors in \(\mathbb {R}^{96}\)). In this work, we adopt the cosine variance schedule proposed in Nichol and Dhariwal22.

A notable property of the forwards process is that given clean data \(\textbf{x}_0\), noised data at any time-step \(\textbf{x}_t\) can be sampled in closed-form:

$$\begin{aligned} q(\textbf{x}_t | \textbf{x}_0) := \mathscr {N}(\textbf{x}_t; \sqrt{\overline{\alpha }_t}\textbf{x}_{0}; (1-\overline{\alpha }_t)\textbf{I})\, , \end{aligned}$$
(2)

where \(\alpha _t:= 1-\beta _t\) and \(\overline{\alpha }_t = \prod _{s=1}^{t} \alpha _s\). This can be reparametrized23 as:

$$\begin{aligned} \textbf{x}_t = \sqrt{\overline{\alpha }_t}\textbf{x}_0 + \sqrt{1 - \overline{\alpha }_t}\varvec{\epsilon }\, , \end{aligned}$$
(3)

where \(\varvec{\epsilon } \sim \mathscr {N}(0, \textbf{I})\) and is dimensionally equivalent to \(\textbf{x}_0\).

The reverse process is then defined to be

$$\begin{aligned} p_{\theta }(\textbf{x}_{t-1} | \textbf{x}_t) := \mathscr {N}(\textbf{x}_{t-1}; \varvec{\mu }_{\theta }(\textbf{x}_t, t); \sigma _{t}^2\textbf{I})\, . \end{aligned}$$
(4)

In this work, we fix \(\sigma _{t}^2 = \beta _t\). Then, as shown in Ho et al.12, by rewriting \(\varvec{\mu }_{\theta }\) as a linear combination of \(\textbf{x}_t\) and \(\varvec{\epsilon }_\theta\), a neural network that predicts \(\varvec{\epsilon }\) from \(\textbf{x}_t\) with input and output dimensions equal to that of the noise it predicts, the reverse process may be rewritten as:

$$\begin{aligned} \textbf{x}_{t-1} = \frac{1}{\sqrt{\alpha _t}} \left( \textbf{x}_t - \frac{1 - \alpha _t}{\sqrt{1 - \overline{\alpha }_t}}\varvec{\epsilon }_\theta (\textbf{x}_t,t)\right) + \sigma _t\textbf{z}\, , \end{aligned}$$
(5)

where \(\textbf{z} \sim \mathscr {N}(0, \textbf{I})\).

To train the DDPM, noise is added to \(\textbf{x}_0\) using the forward process \(q(\textbf{x}_t | \textbf{x}_{0})\) for a randomly sampled \(t \sim \text {Uniform}(\{1,\ldots , T\})\), which the neural network then learns to remove through the reverse process.

Four different versions of the DDPM were trained on SuperCon: one for cuprates, one for pnictides, one for others, and one for all classes (“everything”). The training datasets for each version of the DDPM were randomly split into training and validation sets in an approximately \(95\%-5 \%\) proportion. Training curves for all versions of the DDPM were able to converge and stabilize after around 50 epochs, and each version of the DDPM was trained for between 50 and 100 epochs, depending on the approximate lowest validation loss. For all versions of the DDPM, NAdam24 was chosen as the optimizer, and provided satisfactory results. Moreover, like in Ho et al.12, T was set to 1000 and the U-Net25 neural network architecture was used for \(\varvec{\epsilon }_{\theta }\) (for this work, a 1D U-Net was used as opposed to the 2D U-Net used for images).

Conditioning

Iterative Latent Variable Refinement (ILVR)14 was used to condition the DDPM. Because ILVR is training-free, the same four trained unconditional DDPMs could be relatively easily modified for conditioning.

Figure 3
figure 3

Overview of the Iterative Latent Variable Refinement14 method used. The vector image representation is the same as explained in Fig. 2. \(\mathrm {YBa_{1.4}Sr_{0.6}Cu_{3}O_{6}Se_{0.51}}\)31 is an example of a reference superconductors and \(\mathrm {YBa_{1.4}Sr_{0.6}Cu_{3}O_{6}Se_{0.18}As_{0.32}}\) is an example of a generated output.

ILVR is a slight modification to the reverse diffusion process, and the general structure of ILVR used is shown in Fig. 3. At each step of the reverse “denoising” process, instead of sampling \(\textbf{x}_{t-1}\) directly from \(p_{\theta }(\textbf{x}_{t-1}|\textbf{x}_t)\) like in unconditional DDPM, \(\textbf{x}_{t-1}\) instead becomes

$$\begin{aligned} \textbf{x}_{t-1} = \phi _{N}(\textbf{y}_{t-1}) + \textbf{x}_{t-1}' - \phi _{N}(\textbf{x}_{t-1}')\, , \end{aligned}$$
(6)

where \(\textbf{x}_{t-1}' \sim p_{\theta }(\textbf{x}_{t-1}' | \textbf{x}_t)\) is the original unconditional proposal, \(\textbf{y}_{t-1} \sim q(\textbf{y}_{t-1} | \textbf{y})\) is the condition encoding by the noising process in Eq. (2), and \(\phi _N\) is a linear low-pass filtering operation that maintains the dimensionality of the input.

The goal of ILVR conditioning is to have \(\phi _{N}(\textbf{x}_{0}) = \phi _{N}(\textbf{y})\), thereby allowing the generated output \(\textbf{x}_0\) to share high-level features with reference \(\textbf{y}\). In this case, the generated superconductor should have similar chemical composition as the reference superconductor.

In Choi et al.14, it was stated that the scale factor N could be changed to control the amount of information brought from the reference to the generated output, where lower N results in greater similarity between generated output and reference and higher N results in only coarse information from the reference being brought by the model to the generated output. In our work, we found that \(N > 4\) resulted in large numbers of invalid compounds with negative amounts of elements. As a result, we used \(N = 2\) up to \(N = 4\), but we still found the conclusions made about changing N in Choi et al.14 applicable.

Sampling

As mentioned previously, we trained four versions of the unconditional model, each of which was then copied and modified with ILVR conditioning to also create four versions of the conditional model. We thus have four versions of the unconditional DDPM (without ILVR), which we call “unconditional SuperDiff”, and four versions of the conditional DDPM (with ILVR), which we call “conditional SuperDiff.” On a single consumer Nvidia RTX 3060 Ti GPU, each version of SuperDiff was trained in under 2 h, and we sampled 500,000 compounds from each of the four unconditional SuperDiff versions, which took less than 10 h for each version. These relatively fast training and inference times make SuperDiff a system that can be trained and used using resources at most universities and even consumers. For conditional SuperDiff, we sampled varying amounts of compounds for different reference superconductors, and we discuss those results later.

All sampled compounds were initially screened through various quality checks to ensure that all generated compounds were reasonably realistic. First, we obviously eliminated all generated compounds with negative amounts of elements. Note that we round all amounts of elements to two decimal places beforehand. Next, we eliminate compounds with either too few (only 1) or too many elements—for Cuprates, we limit outputs to compounds with a maximum of 7 elements, and for Pnictides and Others, we limit outputs to compounds with a maximum of 5 elements. After these basic checks, we removed duplicates and further evaluated compound validity with the charge neutrality and electronegativity checks from the SMACT package26. Finally, we ran formation energy prediction with ElemNet27,28. We will discuss the performance of model generations against these checks later.

Clustering

To identify if SuperDiff could generate new superconductor families, clustering analysis was performed. Clustering, which is an unsupervised machine-learning method used to find hidden patterns within data, was applied to the SuperCon database in Roter et al.15, which established that these methods, when applied to superconductors, could exceed human performance in identifying different “families” of superconductors, which are represented as clusters. In this work, we use the clustering method for superconductors from Roter et al.15 to evaluate generated outputs for new families. In Roter et al.15, it was also found that, for superconductors, to visualize clustering results, the t-SNE method worked best. t-SNE is a non-linear dimensionality reduction technique that allows higher dimensional data (96-dimensional superconductor data points in this case) to be represented in 2D or 3D29 (which do not have any physical meaning).

As discussed in the introduction, a major objective of this work was to generate new families of superconductors, as identified by the clustering model—that is, to generate new clusters of superconductors. This was something not accomplished by previous works, including the GAN in Kim and Dordevic6 and the diffusion model in Zhong et al.8. In order to achieve this goal, we experimented with the conditional model’s ability to interpolate between the reference compound and the training dataset. This idea of experimenting with a conditional DDPM’s ability to interpolate between the reference set and training set was proposed in Giannone et al.30 to attempt to achieve few-shot generation on image classes never seen during training. We attempt to do this with superconductors in this work. For instance, we experiment with conditioning the cuprate version of conditional SuperDiff on new, different reference cuprates not in the families of cuprate superconductors in the training dataset. We examine the model’s ability to generate new clusters or families of superconductors using information from the reference compound with this technique, and we report our clustering results below.

Results

In this section we report the performance of SuperDiff on various checks and discuss some interesting new findings. We first evaluate the performance of unconditional SuperDiff with the 500,000 compounds we generated for each of the four classes by performing various computational tests, which included some general compound checks as well as checks for superconductivity. We use the same computational tests for unconditional SuperDiff as used for the GAN in Kim and Dordevic6 and are thus able to directly compare unconditional performance. Afterward, as our most notable results, we evaluate the performance of both the unconditional and conditional versions of SuperDiff on clustering and manually identify and present some promising new families of superconductors generated by the conditional SuperDiff.

Duplicates and validity

For the 500,000 compounds generated by each version of unconditional SuperDiff, we first screened for duplicates between the generated set and the training set (the portion of the SuperCon database of the same class) and for duplicates within the generated set itself. After this, we ran the charge neutrality and electronegativity checks on the generated compounds with the SMACT package26. We present the results of these general tests in Table 1, and then we remove all duplicates from the generated sets.

Table 1 Summary of unconditional SuperDiff performance for the four versions we trained from the 500,000 compounds we sampled from each version.

We notice that the novelty % and uniqueness % of generated results are all very high, which means that unconditional SuperDiff is able to successfully generate both diverse and novel compounds. Unconditional SuperDiff, here, outperforms the GAN in Kim and Dordevic6 in all metrics regarding generation novelty and uniqueness, and, similar to as proposed in their work, we also speculate that the high novelty percentage is due to the non-stoichiometric nature of the compounds we generate, which opens up a large composition space for the model. Notably, unconditional SuperDiff maintains a very high uniqueness % for pnictides despite the small training set, something not accomplished by the Wasserstein GAN in Kim and Dordevic6. This corroborates the observation of the superior ability of DDPMs to generate diverse results when compared to a GAN in other disciplines16. Lastly, although the SMACT check26 results varied greatly between classes and the proportion of valid compounds for some classes was fairly low, the fast inference time justifies that SuperDiff is still able to generate valid compounds reasonably well for all classes.

Overall, these results indicate that all versions of unconditional SuperDiff are able to generate both novel and unique compounds—overcoming the past issues faced by Kim and Dordevic6—as well as valid compounds. As conditional SuperDiff maintains much of the same components as the unconditional model, it was unsurprising that—in most cases—conditional SuperDiff was also able to generate novel, unique, and valid compounds; however, for conditional SuperDiff, these qualities were very much dependent on the reference compound—we still run these checks on all compounds generated by conditional SuperDiff and filter out invalid compounds.

Formation energy

We further validated the performance of SuperDiff on generating synthesizable compounds by predicting the formation energies of the generated compounds with ElemNet27,28, which is a deep neural network model for predicting material properties from only elemental compositions. We chose ElemNet for our formation energy prediction because of its ability to use only chemical composition, as we do not consider crystal structure in our generation process. Because ElemNet does not take in compounds as column vectors in \(\mathbb {R}^{96}\), as SuperDiff does, but instead takes them in as column vectors in \(\mathbb {R}^{86}\) with certain elements removed, we ran the ElemNet formation energy prediction on only the compounds generated by SuperDiff that ElemNet would directly support—this did constitute the great majority of generated compounds. We display the distributions for the predicted formation energies of generated compounds in Fig. 4.

Figure 4
figure 4

Distribution of ElemNet27,28 predicted formation energies of the generated compounds from the four versions of unconditional SuperDiff—(a) Everything, (b) Cuprates, (c) Pnictides, and (d) Others—as well as (e) Cuprates version of conditional SuperDiff conditioned on \(\mathrm {YBa_{1.4}Sr_{0.6}Cu_{3}O_{6}Se_{0.51}}\)31 and (f) Pnictides version of conditional SuperDiff conditioned on \(\mathrm {BaFe_{1.7}Ni_{0.3}As_{2}}\)32. Also shown are the average formation energy for each distribution.

As shown in the figure, unconditional SuperDiff generated a majority of compounds with negative formation energy for all classes of superconductors, with the mean formation energy for all classes predicted to be negative as well. In Jha et al.27, it was stated that negative formation energy values for compounds are a good indicator of their stability and synthesizability; therefore, although these predictions are not definitive proof—experimentation validation would be necessary—these predictions provide an indication that most of the compounds generated by unconditional SuperDiff are plausibly stable and synthesizable.

For conditional SuperDiff, the distribution of formation energies for generated compounds is heavily dependent on the reference compound. However, given a reasonable reference compound—that is, a valid reference compound that belongs to the class of superconductor that the version of SuperDiff was trained on—we demonstrate that conditional SuperDiff is able to generate compounds predicted to be stable by ElemNet. Specifically, as shown in Fig. 4, for the cuprates version of conditional SuperDiff conditioned on \(\mathrm {YBa_{1.4}Sr_{0.6}Cu_{3}O_{6}Se_{0.51}}\)31 and the pnictides version of conditional SuperDiff conditioned on \(\mathrm {BaFe_{1.7}Ni_{0.3}As_{2}}\)32—some of the compounds we conditioned conditional SuperDiff on to find new families of superconductors later—the predicted distribution of formation energies for generated compounds show all generated compounds to have negative formation energy. These results indicate that, given reasonable reference compounds, conditional SuperDiff can generate plausibly stable and synthesizable compounds, which is not surprising given the fundamental architecture similarities between conditional and unconditional SuperDiff.

Superconductivity

After those general checks, we performed some computational checks for superconductivity in order to verify that unconditional SuperDiff is indeed able to generate probable superconductors. We ran the compounds generated by unconditional SuperDiff through the K-Nearest Neighbors (KNN) classification model and regression model from Roter and Dordevic4 for predicting superconductivity and critical temperature, respectively, based on elemental composition.

For the predicted proportion of generated compounds that were superconducting, we accounted for the inherent probabilistic error of the classification model by using Bayesian statistics to estimate the true proportion of superconducting generated compounds given the classification model’s predicted proportion \(p_{sc}\) and the true positive \({{{\textsf {\textit{tp}}}}}\) and false positive rates \({{{\textsf {\textit{fp}}}}}\) of the classification model. The true proportion of generated compounds that are superconductors \(\rho _{sc}\) may be estimated as6

$$\begin{aligned} \rho _{sc} \approx \frac{p_{sc} - {{{\textsf {\textit{fp}}}}}}{{{{\textsf {\textit{tp}}}}} - {{{\textsf {\textit{fp}}}}}}\, , \end{aligned}$$
(7)

where \({{{\textsf {\textit{tp}}}}} = 98.69\%\) and \({{{\textsf {\textit{fp}}}}} = 16.94\%\) are reported by Roter and Dordevic4.

For the generated compounds that were predicted to be superconducting, we used the regression model in Roter and Dordevic4 to predict their critical temperatures. Like all other tests done so far, this computational prediction is only an approximation. We tabulated the results of the classification and regression predictions on the compounds generated by unconditional SuperDiff in Table 1. We will discuss the predicted superconductivity of compounds generated by conditional SuperDiff later.

As seen in the table, all versions of unconditional SuperDiff were able to generate predicted superconductors at a rate comparable to the GAN in Kim and Dordevic6 and much higher than the 3% achieved by manual search in Hosono et al.2—notably, unconditional SuperDiff seems to perform much better on pnictides despite the small training set. This is further indication of the effectiveness of computational search for superconductors when compared to manual searches. Moreover, unconditional SuperDiff seems to capture the critical temperature distribution of the SuperCon training dataset much better than the GAN in Kim and Dordevic6.

Although actual synthesis and testing in a lab are required to confirm superconductivity, these checks, combined with the clustering analysis results that we will discuss later, provide a general indication that unconditional SuperDiff is able to generate highly plausible superconductors.

Clustering results

We ran the clustering analysis described previously on both unconditional and conditional SuperDiff. We display the clustering results for the cuprates version of unconditional SuperDiff in Fig. 5. Superconductors from the SuperCon database are shown with full circles of different colors, whereas our predictions are shown with open black circles. Although unconditional SuperDiff generated compounds in all known clusters or families of superconductors, no new families of superconductors were generated by unconditional SuperDiff—this was true for the other versions of unconditional SuperDiff as well. This was the expected result for unconditional SuperDiff as the underlying DDPM’s goal is to just find a mapping from Gaussian noise to the training data distribution, not some other new distribution. However, superconductor discovery has a particular interest in the generation of new families of superconductors, so a method to control the generation process to change the generated data distribution is desirable. With conditional SuperDiff, we are able to control the generation process to computationally generate new families of superconductors for the first time.

Figure 5
figure 5

Clustering of the (a) valid generated compounds from the Cuprates version of unconditional SuperDiff and (b) valid generated compounds from the Others version of conditional SuperDiff conditioned on various different compounds. Colored full circles represent data points from SuperCon (cuprates only for (a) and others only for (b)), with each color representing a different cluster, or family, of superconductor as identified by the model from Roter et al.15; black open circles are compounds generated by SuperDiff. We notice that unconditional SuperDiff did not generate any new families of superconductors, as all generated compounds fall within the existing clusters of superconductors from SuperCon. However, for conditional SuperDiff, although some generated superconductors fall within the existing SuperCon clusters, we were able to identify two new clusters consisting of only generated superconductors (marked with arrows). These two new clusters correspond to two new families of superconductors generated by SuperDiff: \(\mathrm {Li_{1-x}Be_{x}Ga_{2}Rh}\) and \(\mathrm {Na_{1-x}Al_{1-y}Mg_{x+y}Ge_{1-z}Ga_{z}}\).

In Fig. 5, we also display a sample clustering result from the “others” version of conditional SuperDiff conditioned on various compounds. As seen in the plot, we identified two new clusters: \(\mathrm {Li_{1-x}Be_{x}Ga_{2}Rh}\), which was generated by conditioning SuperDiff on \(\mathrm {LiGa_2Rh}\)33, and \(\mathrm {Na_{1-x}Al_{1-y}Mg_{x+y}Ge_{1-z}Ga_{z}}\), which was generated by conditioning SuperDiff on \(\textrm{NaAlGe}\)34. Those and other predicted families will be discussed in more detail below.

These clustering results show that, with this ability to control generation, and by conditioning SuperDiff on compounds not in the SuperCon training set, SuperDiff is able to use information from various reference compounds to generate completely new families of superconductors. As expected, due to the nature of the conditioning method, we note that for these generated new families, the reference compound does belong to the new cluster generated based on it; however, one of the main contributions presented in this work is that we are able to extrapolate a new family of superconductors from an otherwise single reference compound. We performed this clustering analysis on all versions of conditional SuperDiff conditioned on a variety of different reference compounds, and we discuss the promising new families of superconductors generated by conditional SuperDiff in more detail and verify their superconductivity below.

Promising generated new families

After running clustering analysis for the different versions of conditional SuperDiff conditioned on a variety of reference compounds, we manually identified the most promising new families of superconductors generated by conditional SuperDiff. Beyond the novelty, uniqueness, and SMACT checks, we further checked for the novelty of these newly generated families by searching on the internet and through other databases—these newly generated families could not be found anywhere else. We tabulated these most promising new families of superconductors generated by conditional SuperDiff in Table 2. There, we identified the reference compound used as well as a few output examples and their respective predicted \(T_c\) using the regression model in Roter and Dordevic4, and determined the general formula for the new family. We notice that most compounds generated with conditional SuperDiff are predicted to be superconducting, with predicted \(T_c\) being reasonable for each class. Additionally, a particularly interesting result to note was that our model seemed to generate some new families of superconductors with double or, in one case, even triple doping. This is an interesting new avenue for superconductor discovery that has not been extensively studied, which our model suggests should be explored in more detail by material scientists.

Table 2 Promising new families of superconductors generated by conditional SuperDiff.

We further demonstrate that conditional SuperDiff is able to generate realistic new families of superconductors by plotting the predicted \(T_c\) using the regression model in Roter and Dordevic4 against the Cesium doping content for the newly generated \(\mathrm {Ba_{2-x}Cs_{x}CuO_{3.3}}\) family in Fig. 6. We notice that the generated \(\mathrm {Ba_{2-x}Cs_{x}CuO_{3.3}}\) family is predicted to exhibit the expected parabolic \(T_c\) doping dependence relationship for this type of cuprate superconductor, which was observed previously in other cuprate families35.

Figure 6
figure 6

Plot of \(T_c\) predicted by the regression model in Roter and Dordevic4 versus Cesium content (x) for \(\mathrm {Ba_{2-x}Cs_{x}CuO_{3.3}}\) family generated by conditional SuperDiff (see Table 2). We notice a characteristic parabolic dependence of \(T_c\) versus doping, observed previously in other cuprate families35.

These findings again show that SuperDiff is not only able to generate new superconductors within known families but is also able to overcome the limitations of previous generative models to generate completely new families of superconductors that are also realistic—although we note that for some reference compounds, SuperDiff was also unable to generate new families of superconductors.

Discussion

With the lack of a systematic approach, the discovery of new high \(T_c\) superconductors has long depended on material scientists’ serendipity. Recently, machine learning has been applied to this field to help assist scientists, but past works still lacked many key capabilities, for instance, the ability to computationally find new families of superconductors. Moreover, recent generative model approaches applied to this field also lacked methods of controlling the generation process by incorporating information from reference compounds6,7,8.

In this paper, we have introduced a novel method for superconductor discovery using diffusion models with conditioning functionality that has addressed these major issues. Like previous works applying generative models to superconductor discovery, we were able to generate novel, realistic, and highly plausible superconductors that lie outside of existing databases—leveraging this “inverse design” approach to significantly outperform manual search and previous classification model approaches. With our unconditional model, we were also able to address the low generated compound uniqueness issues that plagued previous works due to the small training data set for pnictides. Most importantly, however, beyond the unconditional performance improvements the diffusion model brought, our contribution of implementing conditioning with ILVR for superconductor discovery to allow the generation process to be controlled enabled the creation of a tool for computationally generating completely new families of superconductors. We verified the generation of new families of superconductors with our clustering analysis, and we presented several of these promising new families of generated superconductors for several different classes of superconductors in Table 2. Once again, we point out that no previous computational model for superconductor discovery would have been capable of generating these new families of superconductors as they attempt to produce only samples that match the training data.

The application of deep generative models for superconductor discovery continues to be a very promising and exciting approach. Future studies can benefit from possible improvements that can be made to SuperDiff, including implementing a physics-informed diffusion model and creating and utilizing a better, more comprehensive training dataset of superconductors. Nevertheless, SuperDiff in its current form is still very powerful as a tool for superconductor discovery, and researchers can currently benefit from it in a myriad of ways, such as by using its novel generations as inspiration—starting with the new families introduced here, using it to expand on their own new discoveries, or by simply experimenting with many more reference compounds (such as high-pressure superconductors) to continue using it to generate completely new families of hypothetical superconductors or hypothetical superconductors with even higher \(T_c\).