Large language models (LLMs) are a class of machine learning model that are trained to take input strings of text, or prompts, and produce likely continuations of them as output. Successful models that have been developed by commercial companies contain billions or trillions of parameters and are trained on massive amounts of publicly available text. These models are typically intended to be general purpose, but they can be customized for specific applications by including key information at the beginning of a prompt string, or by a minimal retraining of the model on new sets of strings, termed fine-tuning. Writing in Nature Machine Intelligence, Jablonka et al.1 demonstrate that GPT-3 (an LLM from OpenAI)2 can be easily fine-tuned on small amounts of data to solve a wide range of chemical inference tasks, even sometimes matching or outperforming highly domain-specific machine learning approaches trained on much larger amounts of data.

LLMs are poised to have a major role in chemistry education and research3, and are already being leveraged to mine data from chemical literature4 and to automate experiments5,6. LLMs have been shown to intrinsically contain chemical knowledge across a wide range of domains7. However, Jablonka et al.1 posed a different question: does the same training that gives LLMs the ability to predict future text from past text gives them the ability to take a new chemical dataset and extrapolate beyond it?

The GPT-3 model, fine-tuned on new data (or with the data presented as part of the prompt), is shown to perform many kinds of chemistry task at a high level (Fig. 1). To perform classification or prediction of specific chemical properties, data are presented to the model as a string of text containing pairs of molecular identifiers and measured features or categories. A new prediction then involves merely providing as prompt the beginning of a new pair. Inverse design, although a harder task, is implemented in the same manner by simply swapping the order of input pairs. Results were surprisingly insensitive to the way in which the molecule was identified, with good performance using both standard IUPAC (International Union of Pure and Applied Chemistry) names as well as more sophisticated models such as SMILES strings that specify all atoms and their bonded topology. The authors have produced open-source packages that make these tasks straightforward to implement.

Fig. 1: Utilizing LLMs such as GPT-3 to answer chemical questions.
figure 1

A language model functions as a ‘black box’ that uses training data consisting of pairs of molecules and data. The model takes in prompts (left) and produces predictions (right). In the case of classification or prediction tasks (red), prompts correspond to molecules and outputs are given by the predicted chemical properties, whereas, in the case of inverse design (blue), a prompt consists of a table of chemical properties, which yields a prediction of a new molecule.

What types of task can be accomplished in this manner? Among many examples, Jablonka et al.1 predict UV absorption peaks, gaps between highest occupied (HOMO) and lowest unoccupied (LUMO) molecular orbitals, and Henry coefficients of adsorption into porous materials. They also reverse this process to predict molecules that have a particular property. The authors show, for example, that they can predict molecules with a particular HOMO-LUMO gap, and that they can extend the approach by restricting the training data to a small range of gaps and predict new molecules with larger gaps. Because this property can be predicted by computational approaches, they even found that this procedure could iteratively lead to a set of molecules with a target feature well outside the range of the start set of molecules. In a final example, the authors predict the adsorption properties of polymeric chemical dispersants whose sequence can be encoded as a custom string of characters. This demonstrates that the use cases are not restricted to problems involving small organic molecules with properties that follow in a relatively straightforward way from their shape and electronic structure.

The generality and simplicity of the strategy provided by Jablonka et al.1 and the low cost compared with extensive experimental or computational screens suggests it could become a common second or even first step in chemical design. The approach is also likely to find its way into the workflow of automated agents that design new chemistries. Combined with tools that enable researchers to parse the literature, it could facilitate the quick generation of hypotheses based on sparse experimental measurements. One domain of application would consist in tuning the phase behaviour of complex mixtures of block co-polymers, polyelectrolytes or disordered proteins. If this approach proves popular, it would be a compelling reason for scientists to report and publish data in tabulated formats that are easily adaptable to such inference problems.

Overall, the study by Jablonka et al.1 leads to clear questions about the fundamental properties of LLMs and their underlying transformer architecture. For a start, what is it about the training corpus that leads to the ability to connect chemical structures with properties that are distinct from those on which the model was originally trained? Second, is this connection similar to human reasoning given general expertise and a similar set of information? Moreover, what in the LLM constrains the chemical space for inverse design problems to realistic molecules, and how can we adapt the training to constrain outputs to easily synthesizable derivatives of known molecules? Finally, could we build an even better model that takes the already high abilities of LLMs to perform chemical inference, and expand on it — for example, being able to have multimodal inputs, such as taking a picture of a chemical structure or a spectrum and outputting floating point data? We know that ChatGPT can already solve our chemistry homework assignments, and the future appears bright for AI to accelerate molecular and materials discovery.