CREMP: Conformer-rotamer ensembles of macrocyclic peptides for machine learning

Grambow, Colin A.; Weir, Hayley; Cunningham, Christian N.; Biancalani, Tommaso; Chuang, Kangway V.

doi:10.1038/s41597-024-03698-y

Download PDF

Data Descriptor
Open access
Published: 09 August 2024

CREMP: Conformer-rotamer ensembles of macrocyclic peptides for machine learning

Scientific Data volume 11, Article number: 859 (2024) Cite this article

676 Accesses
9 Altmetric
Metrics details

Subjects

Abstract

Computational and machine learning approaches to model the conformational landscape of macrocyclic peptides have the potential to enable rational design and optimization. However, accurate, fast, and scalable methods for modeling macrocycle geometries remain elusive. Recent deep learning approaches have significantly accelerated protein structure prediction and the generation of small-molecule conformational ensembles, yet similar progress has not been made for macrocyclic peptides due to their unique properties. Here, we introduce CREMP, a resource generated for the rapid development and evaluation of machine learning models for macrocyclic peptides. CREMP contains 36,198 unique macrocyclic peptides and their high-quality structural ensembles generated using the Conformer-Rotamer Ensemble Sampling Tool (CREST). Altogether, this new dataset contains nearly 31.3 million unique macrocycle geometries, each annotated with energies derived from semi-empirical extended tight-binding (xTB) DFT calculations. Additionally, we include 3,258 macrocycles with reported passive permeability data to couple conformational ensembles to experiment. We anticipate that this dataset will enable the development of machine learning models that can improve peptide design and optimization for novel therapeutics.

Direct conformational sampling from peptide energy landscapes through hypernetwork-conditioned diffusion

Article 27 June 2024

Unified rational protein engineering with sequence-based deep representation learning

Article 21 October 2019

GEOM, energy-annotated molecular conformations for property prediction and molecular generation

Article Open access 21 April 2022

Background & Summary

Introduction

Macrocyclic peptides are an emerging class of therapeutics in drug discovery^1,2. Recent advances in affinity selection and display technologies have enabled ultra-large scale screening libraries that identify high-affinity and selective binders for challenging-to-drug proteins^3,4. Importantly, cyclic peptides occupy a unique biophysical space between small molecules and proteins, and cyclization imparts key properties such as increased proteolytic stability and conformational rigidity⁵. The resulting flexible-yet-constrained geometries enable macrocycles to bind to shallow protein surfaces and disrupt protein-protein interactions⁶. Intriguingly, the same dynamic conformational behavior that drives high-affinity binding also drives complex chameleonic properties such as permeability^7,8. Despite their therapeutic potential, this critical conformational behavior is intrinsically challenging to model—accurately modeling geometry and the immense conformational space is difficult to scale. Machine learning methods that can efficiently approximate high-level computational approaches have the potential to enable rational design, yet there currently exist no large-scale datasets of macrocycle structures.

Current Challenges and Approaches

Efficient and accurate conformer generation for macrocycles is challenging due to many factors, including their vast structural diversity, stereochemistry, number of rotatable bonds, and complex intramolecular interactions⁷. Furthermore, the challenges of modeling the vast conformational landscape is compounded by complex solvent-dependent effects⁹. Computational approaches to macrocycle conformer generation broadly leverage both heuristics- and physics-based algorithms to accurately model and sample molecular geometries. For example, both the open-source cheminformatics library RDKit^10,11,12,13 and commercial conformer generation program OpenEye OMEGA Macrocycle^14,15 combine distance geometry algorithms with molecular force fields¹⁶ for diverse macrocycle sampling. Alternatively, low-mode^17,18 and Monte-Carlo¹⁹ search methods have been found to be effective for sampling when combined with molecular dynamics as implemented in Schrödinger’s MacroModel²⁰ and Prime MCS²¹ packages. A critical limitation in the application of these methods to virtual screening is their high computational cost that limits their scalability. A single macrocyclic peptide, with its large number of rotatable bonds and cyclic constraints, readily takes 10³ − 10⁶ × more compute compared to a drug-like small molecule, with even more dramatic computation times when molecular dynamics approaches are used with explicit solvation^9,22.

Datasets to Enable Machine Learning

Recently, machine learning approaches have made significant progress in small-molecule energy prediction^{23,24,25,26,27}, conformer generation^{28,29,30,31,32,33}, and protein structure prediction^{34,35,36,37,38,39}. A critical and enabling aspect of these machine learning approaches has been access to large, high-quality training datasets. Experimentally, structural databases such as the Cambridge Structural Database (CSD)⁴⁰ and Protein Data Bank (PDB)⁴¹ provide 10⁵ to 10⁶ training examples. Similarly, computational datasets such as QM9^42,43 and GEOM⁴⁴ have inspired machine learning research to achieve DFT-level accuracy at a fraction of the computational cost^24,25,26. Unfortunately, extensive datasets on macrocycles remain scarce with only dozens to hundreds of available experimental structures. Computational datasets such as PEPCONF⁴⁵ and SPICE⁴⁶ datasets have recently begun to provide high-quality conformations at high levels of theory, but are rich in linear peptides; the total number of macrocycles remains insufficient for training machine learning models.

Here, we present a novel computational dataset of macrocyclic peptides, CREMP (Conformer-Rotamer Ensembles of Macrocyclic Peptides), generated with the Conformer-Rotamer Ensemble Sampling Tool (CREST) package⁴⁷. CREST leverages an iterative metadynamics algorithm with a genetic structure-crossing algorithm to explore diverse geometries, and leverages semiempirical quantum mechanical GFNn-xTB⁴⁸ calculations to provide better geometries and energy estimates than classical force fields. Initial explorations have demonstrated the potential for CREST to generate diverse macrocycle ensembles that recapitulate key intramolecular hydrogen bonds and the feasibility for ring interconversion, which is further supported by our Technical Validation below. Example conformers from a representative ensemble generated by CREST are illustrated in Fig. 1. We generate a unique set of 36,198 representative 4-, 5-, and 6-mer homodetic cyclic peptides and perform CREST simulations to generate extensive conformer ensembles in chloroform⁴⁹, providing nearly 31.3 million unique conformers with energy annotations. Chloroform was chosen as the implicit solvation model to approximate the nonpolar environment within cell membranes and facilitate modeling of membrane permeability. Although modeling aqueous environments is critical, especially for docking simulations in biological systems, achieving high-level accuracy requires more involved explicit solvent simulations, which are prohibitive for a dataset of this size within our computational allocation²². In total, this dataset constitutes 3.9 million CPU hours of compute. Summary statistics of the entire dataset are shown in Table 1.

Table 1 Dataset statistics for CREMP.

Full size table

Additionally, building upon the recently published CycPeptMPDB database⁵⁰, which contains experimental passive membrane permeability data compiled from a variety of literature sources but lacks structural information, we augment a selection of 3,258 6-, 7-, and 10-mer cyclic peptides with their CREST-computed conformer ensembles, yielding nearly 8.7 million additional unique conformers in this CREMP-CycPeptMPDB dataset. To our knowledge, this constitutes the first large-scale dataset containing both complete 3D-structural ensembles along with permeability annotations. We believe that such data will enable enhanced insight into the relationship between 3D structure and permeability. Summary statistics for these data are shown in Table 2.

Table 2 Dataset statistics for CREMP-CycPeptMPDB macrocyclic peptides.

Full size table

We make these datasets publicly available to provide the research community with a valuable data resource to train and develop machine learning models, and anticipate that it will find broad utility for learning models capable of predicting macrocyclic peptide properties, such as binding affinity, stability, and permeability, among others. We hope the insights gained from these models can be used for rational design of novel macrocycles with improved therapeutic potential. We anticipate that this work will not only accelerate the development of new macrocyclic therapeutics but also contribute to a deeper understanding of the factors governing their conformational behavior, paving the way for more efficient computational approaches in the future.

Methods

Overview

In this study, we used CREST to generate diverse conformer ensembles for a diverse set of macrocyclic peptides, with the aim of developing a representative dataset that captures the complexity of their conformational landscape. CREST was chosen due to its ability to efficiently explore the conformational space while providing a balance between computational cost and accuracy compared to force field-based methods.

Sequence Identities and Processing

Many factors contribute to the diverse conformational shapes of macrocyclic peptides, including ring size, side chains, stereochemistry, and chemical modifications. To gain coverage of a diverse set of molecules, we began with a previously enumerated set of cyclic tetrapeptides from Chan et al.⁵¹ and supplemented with additional tetra-, penta-, and hexapeptides, limiting our set to exclusively homodetic peptides. Macrocycle sequences were randomly sampled across three key parameters including 1) side chain - the canonical 20 side chains, 2) stereochemistry - we used both L- and D-amino acids, and 3) N-methylation. The resulting set covered a total of 36,198 unique sequences. All macrocycle sequences were converted to their corresponding canonical SMILES in RDKit to check for duplicates and simple dataset statistics were collected using RDKit (number of atoms, etc.).

Additionally, 3,258 sequences were selected from the CycPeptMPDB⁵⁰ database spanning both homo- and heterodetic hexa-, hepta-, and decapeptides.

RDKit Conformer Generation

CREST performs best when initialized with a structure that is sufficiently close to the lowest-energy conformer. To generate initial geometries, we used RDKit ETKDGv3^11,12 with macrocycle torsion preferences to generate conformers using EmbedMultipleConfs with up to 5,000 conformers (numConfs=5000) and random coordinate initialization (useRandomCoords=True), which has been shown to be beneficial for generating macrocycle geometries¹². Conformers were subsequently optimized using MMFF94¹⁶ as implemented in RDKit and sorted by energy. Finally, the sorted conformers were filtered based on heavy-atom RMSD with a threshold of 0.5 Å.

Optimization with xTB

CREST recommends initialization with a structure optimized at the same level of theory as used for the energy evaluations during the simulation. Therefore, for each molecule, we reoptimized the 1,000 lowest-energy conformers from RDKit/MMFF94 using GFN2-xTB⁴⁸ and the ALPB solvent model in chloroform⁴⁹. The resulting lowest-energy optimized structure was used as the single input to CREST.

CREST Simulation

The single xTB-optimized structure was used as input to CREST using solvation in chloroform and default arguments, which include an energy window of 6.0 kcal/mol, an RMSD threshold of 0.125 Å, and an energy threshold between conformers of 0.05 kcal/mol. Each calculation used 14 cores in order to efficiently parallelize across 14 metadynamics runs, and took an average of 2.3 hours for 4-mers, 5.8 hours for 5-mers, and 13.9 hours for 6-mers.

Graph Reidentification

It is possible for low-energy chemical reactions to occur during the CREST simulation, e.g., intramolecular proton transfers, which may produce ensembles that contain different constitutional isomers. We used OpenEye’s OEChem library to reidentify chemical graphs for each conformer in an ensemble and only retained ensembles where all conformers have the same chemical graph.

Data Records

We make the CREMP dataset available online⁵² at https://doi.org/10.5281/zenodo.7931444 with examples on how to load and analyze data in our public GitHub repository. We provide the data in two available formats, either as Python pickle files, which provide quick read access with RDKit version 2022.09.5 or later, and as text-based SDF files with associated metadata in JSON format. Each file is named based on its linearized amino acid sequence, with residues separated by periods, using standard one-letter codes with lowercase letters representing D-amino acids and Me prefixes representing N-methylated amino acids. The linearized sequences are in no particular order, e.g., C.R.E.M.P and R.E.M.P.C correspond to the same peptide macrocycle (no duplicated macrocycles are present in the dataset as outlined previously). The filename extensions are .pickle, .sdf, and .json.

Each file in pickle contains a Python dictionary with amino acid sequence, SMILES, CREST metadata (energy, entropy, etc.), and a single RDKit rdkit.Chem.Mol object containing all conformers. All conformers of each molecule were validated to be the same structural isomer, and all files were compressed into a single archive. In sdf_and_json, each individual SDF contains all conformers, each associated with its own JSON file that contains CREST metadata. Similarly, all are compressed into another single archive. A single summary CSV is also provided containing sequence, smiles, num_monomers, num_atoms, num_heavy_atoms, along with the CREST metadata totalconfs, uniqueconfs, lowestenergy, poplowestpct, temperature, ensembleenergy, ensembleentropy, and ensemblefreeenergy. The number of unique conformers with different 3D structures is given by uniqueconfs, while totalconfs includes the number of rotamers in addition.

We also make the CREMP-CycPeptMPDB dataset available⁵³ at https://doi.org/10.5281/zenodo.10798261, which contains the conformer ensembles of a subset of macrocyclic peptides available in the CycPeptMPDB database⁵⁰ containing cell membrane permeability annotations. The format of this data record is identical to that of CREMP.

Technical Validation

In addition to manually verifying the generated ensembles and designing the graph reidentification step so that all conformers in each ensemble have the same chemical graph, we analyzed the generated structures and performed additional characterization of CREST to validate its effectiveness in the space of macrocyclic peptides.

Analysis of Structures

We performed a systematic evaluation of the data generated to visualize and quantify the distributions of conformers within CREMP in order to characterize the conformational and structural diversity of the generated ensembles. The overall dataset statistics are shown in Table 1. The ensembles are diverse in size ranging from ones as small as a single low-energy conformer to a single ensemble with more than 12,000 conformers in the 6 kcal/mol energy window.

Figure 2 visualizes the distributions of ensemble size, energy, entropy, and the occupation probability of the highest-population conformer. The occupation probability distributions demonstrate that it is generally not sufficient to only consider the lowest-energy conformer and that including many low-lying states may be required.

Figure 3 illustrates the conformer ensembles of three different macrocyclic peptides using UMAP⁵⁴. Each UMAP plot is generated by computing the sine and cosine of each cycle torsion and projecting them into two dimensions across the entire conformer ensemble for each peptide. The distinct plots and several clusters within each plot demonstrate the conformational diversity of backbones within each macrocycle and across peptides. Additionally, each conformer is colored based on its number of internal hydrogen bonds, its radius of gyration, and its energy relative to the lowest-energy conformer in the ensemble. The different values of these properties for each cluster further illustrate the diversity of backbone conformations and the variation within each cluster. In particular, each backbone conformational cluster tends to be associated with a unique number of internal hydrogen bonds, which may stabilize a particular backbone conformation.

Structural Validation with NMR Ensembles

CREST has already been shown to recover relevant conformers of some macrocyclic peptides⁴⁷, but such assessment has been limited in scope due to low availability of experimental data on structural ensembles of macrocycles. Moreover, we expect both the experimental and computational methods to have a strong effect on the structures. In particular, the continuum solvent approximation ALPB⁴⁹ is likely not sufficient for strongly polar solvents, particularly when they are hydrogen-bond donors.

To investigate this, we performed a limited analysis on NMR ensembles available in the PDB. In order to characterize the quality of the CREST ensembles computed for the PDB structures, we compute Matching (MAT) scores³⁰. For each molecule:

$$\,{\rm{MAT}} \mbox{-} {\rm{RMSD}}\,\left({{\mathbb{S}}}_{{\rm{CREST}}},{{\mathbb{S}}}_{{\rm{PDB}}}\right)=\frac{1}{| {{\mathbb{S}}}_{{\rm{PDB}}}| }\sum _{{\bf{R}}{\prime} \in {{\mathbb{S}}}_{{\rm{PDB}}}}\mathop{\min }\limits_{{\bf{R}}\in {{\mathbb{S}}}_{{\rm{CREST}}}}\,{\rm{RMSD}}\,\left({\bf{R}},{\bf{R}}{\prime} \right)$$

(1)

where ${{\mathbb{S}}}_{{\rm{CREST}}}$ is the set of conformations generated by CREST and ${{\mathbb{S}}}_{{\rm{PDB}}}$ is the set of experimental NMR structures. R denotes a conformation and RMSD is computed using heavy atoms only. Intuitively, smaller MAT-RMSD scores correspond to CREST ensembles that contain more realistic conformers and are able to recover the relevant NMR conformers. Similarly, we define a MAT score for the ring torsion fingerprint deviation (rTFD):

$$\,{\rm{MAT}} \mbox{-} {\rm{rTFD}}\,\left({{\mathbb{S}}}_{{\rm{CREST}}},{{\mathbb{S}}}_{{\rm{PDB}}}\right)=\frac{1}{| {{\mathbb{S}}}_{{\rm{PDB}}}| }\sum _{{\bf{R}}{\prime} \in {{\mathbb{S}}}_{{\rm{PDB}}}}\mathop{\min }\limits_{{\bf{R}}\in {{\mathbb{S}}}_{{\rm{CREST}}}}\,{\rm{rTFD}}\,\left({\bf{R}},{\bf{R}}{\prime} \right)$$

(2)

where rTFD quantifies how well the torsion angles in the macrocycle match between two conformations R and ${\bf{R}}{\prime} $:¹²

$$\,{\rm{rTFD}}\,\left({\bf{R}},{\bf{R}}{\prime} \right)=\frac{1}{{N}_{{\rm{torsions}}}}\mathop{\sum }\limits_{i=1}^{{N}_{{\rm{torsions}}}}\frac{| w\left({\tau }_{i}\left({\bf{R}}\right)-{\tau }_{i}\left({\bf{R}}{\prime} \right)\right)| }{\pi }$$

(3)

Here, ${\tau }_{i}\left({\bf{R}}\right)$ computes the i-th torsion angle for conformation R and w(⋅) wraps its argument around the cyclic domain [−π, π) on which torsion angles are defined. The π in the denominator normalizes each deviation so that rTFD lies in [0, 1]. Smaller values of MAT-rTFD correspond to CREST ensembles that accurately reproduce the ring torsions of the reference NMR data. We also compute a coverage metric, which measures the fraction of correctly generated conformers within an RMSD threshold of δ_RMSD³⁰. In other words, the coverage metric quantifies how many experimental conformers are recovered with CREST. It is given by:

$$\,{\rm{Coverage}}\,\left({{\mathbb{S}}}_{{\rm{CREST}}},{{\mathbb{S}}}_{{\rm{PDB}}}\right)=\frac{1}{| {{\mathbb{S}}}_{{\rm{PDB}}}| }\left|\left\{{\bf{R}}{\prime} \in {{\mathbb{S}}}_{{\rm{PDB}}}:\exists {\bf{R}}\in {{\mathbb{S}}}_{{\rm{CREST}}},\,\,{\rm{RMSD}}\,\left({\bf{R}},{\bf{R}}{\prime} \right)\le {\delta }_{{\rm{RMSD}}}\right\}\right|$$

(4)

Figure 4 illustrates the range of MAT-RMSD and MAT-rTFD scores computed between CREST and NMR ensembles in various solvents as well as the coverage over a range of RMSD thresholds. While the performance of CREST is relatively poor in polar solvents, especially water, likely due to the use of implicit solvation, it excels at generating high-quality ensembles in chloroform with MAT-RMSD ~1 Å and MAT-rTFD ~0.1. Moreover, the coverage plot shows that all experimental conformers in chloroform can be recovered within a threshold of ~1.5 Å.

To provide more fine-grained insight into the ensembles in chloroform, we plotted the conformers from six available PDB structures along with the corresponding CREST ensembles on Ramachandran plots⁵⁵ in Fig. 5 and highlighted the CREST conformers closest to the PDB conformers based on rTFD, i.e., those that result from the minimization in Eq. (2). The high-density regions of the CREST ensemble overlap with the locations of the PDB conformers, thereby further demonstrating the ability of CREST to generate high-quality ensembles in chloroform.

Usage Notes

Data are provided in two main formats: binary pickle files and text-based SDF and JSON files. SDF and JSON are provided because they are human-readable, software-agnostic, and guarantee maximum interoperability. SDF files can also be read by every major cheminformatics program, which makes them especially suitable for users that are not as familiar with the Python programming environment. The pickle files are provided because they are much faster to load and directly yield RDKit molecules that can be processed in machine learning and cheminformatics workflows. However, pickle files have the downside that they require specific Python packages and sometimes specific versions of those. For the data published here, the Python environment at https://github.com/Genentech/cremp provides details about the required packages and their versions and provides instructions for loading the data. RDKit is a major dependency and version 2022.09.5 or newer is required in order to load the data successfully.

Code availability

Our workflow is built on several open-source toolkits including the RDKit¹⁰, xTB⁴⁸, and CREST⁴⁷. xTB and CREST are both freely available (https://github.com/grimme-lab/xtb/releases and https://github.com/grimme-lab/crest/releases). OpenEye Applications and Toolkits was used for graph reidentification. We provide code with instructions on how to load the data and perform simple CREST runs at https://github.com/Genentech/cremp.

References

Driggers, E. M., Hale, S. P., Lee, J. & Terrett, N. K. The exploration of macrocycles for drug discovery–an underexploited structural class. Nat. Rev. Drug Discov. 7, 608–624 (2008).
Article CAS PubMed Google Scholar
Muttenthaler, M., King, G. F., Adams, D. J. & Alewood, P. F. Trends in peptide drug discovery. Nat. Rev. Drug Discov. 20, 309–325 (2021).
Article CAS PubMed Google Scholar
Huang, Y., Wiedmann, M. M. & Suga, H. RNA display methods for the discovery of bioactive macrocycles. Chem. Rev. 119, 10360–10391 (2019).
Article CAS PubMed Google Scholar
Vinogradov, A. A., Yin, Y. & Suga, H. Macrocyclic peptides as drug candidates: Recent progress and remaining challenges. J. Am. Chem. Soc. 141, 4167–4181 (2019).
Article CAS PubMed Google Scholar
Shinbara, K., Liu, W., van Neer, R. H. P., Katoh, T. & Suga, H. Methodologies for backbone macrocyclic peptide synthesis compatible with screening technologies. Front. Chem. 8, 447 (2020).
Article ADS CAS PubMed PubMed Central Google Scholar
Villar, E. A. et al. How proteins bind macrocycles. Nat. Chem. Biol. 10, 723–731 (2014).
Article CAS PubMed PubMed Central Google Scholar
Whitty, A. et al. Quantifying the chameleonic properties of macrocycles and other high-molecular-weight drugs. Drug Discov. Today 21, 712–717 (2016).
Article CAS PubMed PubMed Central Google Scholar
Bhardwaj, G. et al. Accurate de novo design of membrane-traversing macrocycles. Cell 185, 3520–3532.e26 (2022).
Article PubMed PubMed Central Google Scholar
Linker, S. M. et al. Lessons for oral bioavailability: How conformationally flexible cyclic peptides enter and cross lipid membranes. J. Med. Chem. 66, 2773–2788 (2023).
Article CAS PubMed PubMed Central Google Scholar
Landrum, G. RDKit: Open-source cheminformatics (2006).
Riniker, S. & Landrum, G. A. Better informed distance geometry: Using what we know to improve conformation generation. J. Chem. Inf. Model. 55, 2562–2574 (2015).
Article CAS PubMed Google Scholar
Wang, S., Witek, J., Landrum, G. A. & Riniker, S. Improving conformer generation for small rings and macrocycles based on distance geometry and experimental torsional-angle preferences. J. Chem. Inf. Model. 60, 2044–2058 (2020).
Article CAS PubMed Google Scholar
Wang, S. et al. Incorporating NOE-Derived distances in conformer generation of cyclic peptides with distance geometry. J. Chem. Inf. Model. 62, 472–485 (2022).
Article CAS PubMed Google Scholar
Hawkins, P. C. D., Skillman, A. G., Warren, G. L., Ellingson, B. A. & Stahl, M. T. Conformer generation with OMEGA: algorithm and validation using high quality structures from the protein databank and cambridge structural database. J. Chem. Inf. Model. 50, 572–584 (2010).
Article CAS PubMed PubMed Central Google Scholar
Hawkins, P. C. D. & Nicholls, A. Conformer generation with OMEGA: learning from the data set and the analysis of failures. J. Chem. Inf. Model. 52, 2919–2936 (2012).
Article CAS PubMed Google Scholar
Halgren, T. A. Merck molecular force field. v. extension of MMFF94 using experimental data, additional computational data, and empirical rules. J. Comput. Chem. 17, 616–641 (1996).
Article CAS Google Scholar
Kolossváry, I. & Guida, W. C. Low mode search. an efficient, automated computational method for conformational analysis: Application to cyclic and acyclic alkanes and cyclic peptides. J. Am. Chem. Soc. 118, 5011–5019 (1996).
Article Google Scholar
Kolossváry, I. & Guida, W. C. Low-mode conformational search elucidated: Application to C39H80 and flexible docking of 9-deazaguanine inhibitors into PNP. J. Comput. Chem. 20, 1671–1684 (1999).
Article Google Scholar
Chang, G., Guida, W. C. & Still, W. C. An internal-coordinate monte carlo method for searching conformational space. J. Am. Chem. Soc. 111, 4379–4386 (1989).
Article CAS Google Scholar
Watts, K. S., Dalal, P., Tebben, A. J., Cheney, D. L. & Shelley, J. C. Macrocycle conformational sampling with MacroModel. J. Chem. Inf. Model. 54, 2680–2696 (2014).
Article CAS PubMed Google Scholar
Sindhikara, D. et al. Improving accuracy, diversity, and speed with prime macrocycle conformational sampling. J. Chem. Inf. Model. 57, 1881–1894 (2017).
Article CAS PubMed Google Scholar
Damjanovic, J., Miao, J., Huang, H. & Lin, Y.-S. Elucidating solution structures of cyclic peptides using molecular dynamics simulations. Chem. Rev. 121, 2292–2324 (2021).
Article CAS PubMed PubMed Central Google Scholar
Gilmer, J., Schoenholz, S. S., Riley, P. F., Vinyals, O. & Dahl, G. E. Neural message passing for quantum chemistry. In Precup, D. & Teh, Y. W. (eds.) Proceedings of the 34th International Conference on Machine Learning, vol. 70 of Proceedings of Machine Learning Research, 1263–1272 (PMLR, 2017).
Smith, J. S., Isayev, O. & Roitberg, A. E. ANI-1: an extensible neural network potential with DFT accuracy at force field computational cost. Chem. Sci. 8, 3192–3203 (2017).
Article CAS PubMed PubMed Central Google Scholar
Schütt, K. et al. Schnet: A continuous-filter convolutional neural network for modeling quantum interactions. In Guyon, I. et al. (eds.) Advances in Neural Information Processing Systems, vol. 30 (Curran Associates, Inc., 2017).
Gasteiger, J., Groß, J. & Günnemann, S. Directional message passing for molecular graphs. In International Conference on Learning Representations (2020).
Liu, Y. et al. Spherical message passing for 3d molecular graphs. In International Conference on Learning Representations (2022).
Mansimov, E., Mahmood, O., Kang, S. & Cho, K. Molecular geometry prediction using a deep generative graph neural network. Sci. Rep. 9, 20381 (2019).
Article ADS PubMed PubMed Central Google Scholar
Simm, G. & Hernandez-Lobato, J. M. A generative model for molecular distance geometry. In III, H. D. & Singh, A. (eds.) Proceedings of the 37th International Conference on Machine Learning, vol. 119 of Proceedings of Machine Learning Research, 8949–8958 (PMLR, 2020).
Xu, M., Luo, S., Bengio, Y., Peng, J. & Tang, J. Learning neural generative dynamics for molecular conformation generation. In International Conference on Learning Representations (2021).
Xu, M. et al. Geodiff: A geometric diffusion model for molecular conformation generation. In International Conference on Learning Representations (2022).
Stärk, H., Ganea, O., Pattanaik, L., Barzilay, R. & Jaakkola, T. EquiBind: Geometric deep learning for drug binding structure prediction. In Chaudhuri, K. et al. (eds.) Proceedings of the 39th International Conference on Machine Learning, vol. 162 of Proceedings of Machine Learning Research, 20503–20521 (PMLR, 2022).
Jing, B., Corso, G., Chang, J., Barzilay, R. & Jaakkola, T. Torsional diffusion for molecular conformer generation. In Koyejo, S. et al. (eds.) Advances in Neural Information Processing Systems, vol. 35, 24240–24253 (Curran Associates, Inc., 2022).
Jumper, J. et al. Highly accurate protein structure prediction with AlphaFold. Nature 596, 583–589 (2021).
Article ADS CAS PubMed PubMed Central Google Scholar
Baek, M. et al. Accurate prediction of protein structures and interactions using a three-track neural network. Science 373, 871–876 (2021).
Article ADS CAS PubMed PubMed Central Google Scholar
Wu, R. et al. High-resolution de novo structure prediction from primary sequence (2022).
Anand, N. & Achim, T. Protein structure and sequence generation with equivariant denoising diffusion probabilistic models (2022).
Yim, J. et al. SE(3) diffusion model with application to protein backbone generation (2023) 2302.02277.
Wu, K. E. et al. Protein structure generation via folding diffusion. Nat. Commun. 15, 1059 (2024).
Article ADS CAS PubMed PubMed Central Google Scholar
Groom, C. R., Bruno, I. J., Lightfoot, M. P. & Ward, S. C. The cambridge structural database. Acta Crystallogr B Struct Sci Cryst Eng Mater 72, 171–179 (2016).
Article ADS CAS PubMed PubMed Central Google Scholar
Berman, H. M. et al. The protein data bank. Nucleic Acids Res. 28, 235–242 (2000).
Article ADS CAS PubMed PubMed Central Google Scholar
Ruddigkeit, L., van Deursen, R., Blum, L. C. & Reymond, J.-L. Enumeration of 166 billion organic small molecules in the chemical universe database GDB-17. J. Chem. Inf. Model. 52, 2864–2875 (2012).
Article CAS PubMed Google Scholar
Ramakrishnan, R., Dral, P. O., Rupp, M. & von Lilienfeld, O. A. Quantum chemistry structures and properties of 134 kilo molecules. Scientific Data 1 (2014).
Axelrod, S. & Gómez-Bombarelli, R. GEOM, energy-annotated molecular conformations for property prediction and molecular generation. Sci Data 9, 185 (2022).
Prasad, V. K., Otero-de-la Roza, A. & DiLabio, G. A. PEPCONF, a diverse data set of peptide conformational energies. Sci Data 6, 180310 (2019).
Article CAS PubMed PubMed Central Google Scholar
Eastman, P. et al. SPICE, a dataset of drug-like molecules and peptides for training machine learning potentials. Sci Data 10, 11 (2023).
Article CAS PubMed PubMed Central Google Scholar
Pracht, P., Bohle, F. & Grimme, S. Automated exploration of the low-energy chemical space with fast quantum chemical methods. Phys. Chem. Chem. Phys. 22, 7169–7192 (2020).
Article CAS PubMed Google Scholar
Bannwarth, C., Ehlert, S. & Grimme, S. GFN2-xTB-an accurate and broadly parametrized self-consistent tight-binding quantum chemical method with multipole electrostatics and density-dependent dispersion contributions. J. Chem. Theory Comput. 15, 1652–1671 (2019).
Article CAS PubMed Google Scholar
Ehlert, S., Stahn, M., Spicher, S. & Grimme, S. Robust and efficient implicit solvation model for fast semiempirical methods. J. Chem. Theory Comput. 17, 4250–4261 (2021).
Article CAS PubMed Google Scholar
Li, J. et al. CycPeptMPDB: A Comprehensive Database of Membrane Permeability of Cyclic Peptides. Journal of Chemical Information and Modeling. https://doi.org/10.1021/acs.jcim.2c01573 (2023).
Chan, L., Morris, G. M. & Hutchison, G. R. Understanding conformational entropy in small molecules. J. Chem. Theory Comput. 17, 2099–2106 (2021).
Article CAS PubMed Google Scholar
Grambow, C. A., Weir, H., Cunningham, C. N., Biancalani, T. & Chuang, K. V. CREMP: Conformer-Rotamer Ensembles of Macrocyclic Peptides for Machine Learning. Zenodo https://doi.org/10.5281/zenodo.7931444 (2023).
Grambow, C. A., Weir, H., Cunningham, C. N., Biancalani, T. & Chuang, K. V. CREMP-CycPeptMPDB: Conformer-Rotamer Ensembles of Macrocyclic Peptides with Permeability Annotations. Zenodo https://doi.org/10.5281/zenodo.10798261 (2024).
McInnes, L., Healy, J. & Melville, J. Umap: Uniform manifold approximation and projection for dimension reduction (2020). 1802.03426.
Ramachandran, G. N. & Sasisekharan, V. Conformation of polypeptides and proteins. Adv. Protein Chem. 23, 283–438 (1968).
Article CAS PubMed Google Scholar

Download references

Acknowledgements

We gratefully acknowledge Benjamin Sellers for helpful discussions on earlier versions of dataset generation. We also thank members of the Departments of Biology Research and AI Development, Peptide Therapeutics, and Discovery Chemistry at Genentech Research and Early Development for insightful discussions.

Author information

Authors and Affiliations

Prescient Design, Genentech, 1 DNA Way, South San Francisco, CA, 94080, USA
Colin A. Grambow, Hayley Weir & Kangway V. Chuang
Department of Peptide Therapeutics, Genentech, 1 DNA Way, South San Francisco, CA, 94080, USA
Christian N. Cunningham
Biology Research | Development, Genentech, 1 DNA Way, South San Francisco, CA, 94080, USA
Tommaso Biancalani

Authors

Colin A. Grambow
View author publications
You can also search for this author in PubMed Google Scholar
Hayley Weir
View author publications
You can also search for this author in PubMed Google Scholar
Christian N. Cunningham
View author publications
You can also search for this author in PubMed Google Scholar
Tommaso Biancalani
View author publications
You can also search for this author in PubMed Google Scholar
Kangway V. Chuang
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

C.A.G., T.B., and K.V.C. conceived the project. C.A.G. and H.W. performed the calculations and computational analysis. C.C. provided technical and scientific input on macrocyclic peptides. All authors provided feedback on the analyses. C.A.G. and K.V.C. wrote the manuscript with input and revisions from all authors.

Corresponding authors

Correspondence to Colin A. Grambow or Kangway V. Chuang.

Ethics declarations

Competing interests

All authors are employees of Genentech, Inc. and shareholders of Roche.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Grambow, C.A., Weir, H., Cunningham, C.N. et al. CREMP: Conformer-rotamer ensembles of macrocyclic peptides for machine learning. Sci Data 11, 859 (2024). https://doi.org/10.1038/s41597-024-03698-y

Download citation

Received: 11 March 2024
Accepted: 29 July 2024
Published: 09 August 2024
DOI: https://doi.org/10.1038/s41597-024-03698-y