News & Views
Published: 13 March 2024

Generative AI in chemistry

Connecting molecular properties with plain language

Glen M. Hocky ORCID: orcid.org/0000-0002-5637-0698¹

Nature Machine Intelligence volume 6, pages 249–250 (2024)Cite this article

752 Accesses
6 Altmetric
Metrics details

Subjects

AI tools such as ChatGPT can provide responses to queries on any topic, but can such large language models accurately ‘write’ molecules as output to our specification? Results now show that models trained on general text can be tweaked with small amounts of chemical data to predict molecular properties, or to design molecules based on a target feature.

You have full access to this article via your institution.

Large language models (LLMs) are a class of machine learning model that are trained to take input strings of text, or prompts, and produce likely continuations of them as output. Successful models that have been developed by commercial companies contain billions or trillions of parameters and are trained on massive amounts of publicly available text. These models are typically intended to be general purpose, but they can be customized for specific applications by including key information at the beginning of a prompt string, or by a minimal retraining of the model on new sets of strings, termed fine-tuning. Writing in Nature Machine Intelligence, Jablonka et al.¹ demonstrate that GPT-3 (an LLM from OpenAI)² can be easily fine-tuned on small amounts of data to solve a wide range of chemical inference tasks, even sometimes matching or outperforming highly domain-specific machine learning approaches trained on much larger amounts of data.

LLMs are poised to have a major role in chemistry education and research³, and are already being leveraged to mine data from chemical literature⁴ and to automate experiments^5,6. LLMs have been shown to intrinsically contain chemical knowledge across a wide range of domains⁷. However, Jablonka et al.¹ posed a different question: does the same training that gives LLMs the ability to predict future text from past text gives them the ability to take a new chemical dataset and extrapolate beyond it?

The GPT-3 model, fine-tuned on new data (or with the data presented as part of the prompt), is shown to perform many kinds of chemistry task at a high level (Fig. 1). To perform classification or prediction of specific chemical properties, data are presented to the model as a string of text containing pairs of molecular identifiers and measured features or categories. A new prediction then involves merely providing as prompt the beginning of a new pair. Inverse design, although a harder task, is implemented in the same manner by simply swapping the order of input pairs. Results were surprisingly insensitive to the way in which the molecule was identified, with good performance using both standard IUPAC (International Union of Pure and Applied Chemistry) names as well as more sophisticated models such as SMILES strings that specify all atoms and their bonded topology. The authors have produced open-source packages that make these tasks straightforward to implement.

**Fig. 1: Utilizing LLMs such as GPT-3 to answer chemical questions.**

What types of task can be accomplished in this manner? Among many examples, Jablonka et al.¹ predict UV absorption peaks, gaps between highest occupied (HOMO) and lowest unoccupied (LUMO) molecular orbitals, and Henry coefficients of adsorption into porous materials. They also reverse this process to predict molecules that have a particular property. The authors show, for example, that they can predict molecules with a particular HOMO-LUMO gap, and that they can extend the approach by restricting the training data to a small range of gaps and predict new molecules with larger gaps. Because this property can be predicted by computational approaches, they even found that this procedure could iteratively lead to a set of molecules with a target feature well outside the range of the start set of molecules. In a final example, the authors predict the adsorption properties of polymeric chemical dispersants whose sequence can be encoded as a custom string of characters. This demonstrates that the use cases are not restricted to problems involving small organic molecules with properties that follow in a relatively straightforward way from their shape and electronic structure.

The generality and simplicity of the strategy provided by Jablonka et al.¹ and the low cost compared with extensive experimental or computational screens suggests it could become a common second or even first step in chemical design. The approach is also likely to find its way into the workflow of automated agents that design new chemistries. Combined with tools that enable researchers to parse the literature, it could facilitate the quick generation of hypotheses based on sparse experimental measurements. One domain of application would consist in tuning the phase behaviour of complex mixtures of block co-polymers, polyelectrolytes or disordered proteins. If this approach proves popular, it would be a compelling reason for scientists to report and publish data in tabulated formats that are easily adaptable to such inference problems.

Overall, the study by Jablonka et al.¹ leads to clear questions about the fundamental properties of LLMs and their underlying transformer architecture. For a start, what is it about the training corpus that leads to the ability to connect chemical structures with properties that are distinct from those on which the model was originally trained? Second, is this connection similar to human reasoning given general expertise and a similar set of information? Moreover, what in the LLM constrains the chemical space for inverse design problems to realistic molecules, and how can we adapt the training to constrain outputs to easily synthesizable derivatives of known molecules? Finally, could we build an even better model that takes the already high abilities of LLMs to perform chemical inference, and expand on it — for example, being able to have multimodal inputs, such as taking a picture of a chemical structure or a spectrum and outputting floating point data? We know that ChatGPT can already solve our chemistry homework assignments, and the future appears bright for AI to accelerate molecular and materials discovery.

References

Jablonka, K. M. et al. Nat. Mach. Intell. 6, 161–169 (2024).
Article Google Scholar
Brown, T. B. et al. In Advances in Neural Information Processing Systems 33 (NeurIPS 2020).
Hocky, G. M. & White, A. D. Digital Discovery 1, 79–83 (2022).
Article PubMed PubMed Central Google Scholar
Lála, J. et al. Preprint at https://doi.org/10.48550/arXiv.2312.07559 (2023).
Bran, A. M. et al. Preprint at https://doi.org/10.48550/arXiv.2304.05376 (2023).
Boiko, D. A. et al. Nature 624, 570–578 (2023).
Article ADS CAS PubMed PubMed Central Google Scholar
White, A. D. et al. Digital Discovery 2, 368–376 (2023).
Article CAS PubMed PubMed Central Google Scholar

Download references

Author information

Authors and Affiliations

Department of Chemistry and Simons Center for Computational Physical Chemistry, New York University, New York, NY, USA
Glen M. Hocky

Authors

Glen M. Hocky
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Glen M. Hocky.

Ethics declarations

Competing interests

The author declares no competing interests.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Hocky, G.M. Connecting molecular properties with plain language. Nat Mach Intell 6, 249–250 (2024). https://doi.org/10.1038/s42256-024-00812-y

Download citation

Published: 13 March 2024
Issue Date: March 2024
DOI: https://doi.org/10.1038/s42256-024-00812-y

Connecting molecular properties with plain language

Subjects

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Competing interests

Rights and permissions

About this article

Cite this article

Leveraging large language models for predictive chemistry

Search

Quick links

Subjects

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Competing interests

Rights and permissions

About this article

Cite this article

Share this article

Search

Quick links