Background & Summary

Introduction

Macrocyclic peptides are an emerging class of therapeutics in drug discovery1,2. Recent advances in affinity selection and display technologies have enabled ultra-large scale screening libraries that identify high-affinity and selective binders for challenging-to-drug proteins3,4. Importantly, cyclic peptides occupy a unique biophysical space between small molecules and proteins, and cyclization imparts key properties such as increased proteolytic stability and conformational rigidity5. The resulting flexible-yet-constrained geometries enable macrocycles to bind to shallow protein surfaces and disrupt protein-protein interactions6. Intriguingly, the same dynamic conformational behavior that drives high-affinity binding also drives complex chameleonic properties such as permeability7,8. Despite their therapeutic potential, this critical conformational behavior is intrinsically challenging to model—accurately modeling geometry and the immense conformational space is difficult to scale. Machine learning methods that can efficiently approximate high-level computational approaches have the potential to enable rational design, yet there currently exist no large-scale datasets of macrocycle structures.

Current Challenges and Approaches

Efficient and accurate conformer generation for macrocycles is challenging due to many factors, including their vast structural diversity, stereochemistry, number of rotatable bonds, and complex intramolecular interactions7. Furthermore, the challenges of modeling the vast conformational landscape is compounded by complex solvent-dependent effects9. Computational approaches to macrocycle conformer generation broadly leverage both heuristics- and physics-based algorithms to accurately model and sample molecular geometries. For example, both the open-source cheminformatics library RDKit10,11,12,13 and commercial conformer generation program OpenEye OMEGA Macrocycle14,15 combine distance geometry algorithms with molecular force fields16 for diverse macrocycle sampling. Alternatively, low-mode17,18 and Monte-Carlo19 search methods have been found to be effective for sampling when combined with molecular dynamics as implemented in Schrödinger’s MacroModel20 and Prime MCS21 packages. A critical limitation in the application of these methods to virtual screening is their high computational cost that limits their scalability. A single macrocyclic peptide, with its large number of rotatable bonds and cyclic constraints, readily takes 103 − 106  × more compute compared to a drug-like small molecule, with even more dramatic computation times when molecular dynamics approaches are used with explicit solvation9,22.

Datasets to Enable Machine Learning

Recently, machine learning approaches have made significant progress in small-molecule energy prediction23,24,25,26,27, conformer generation28,29,30,31,32,33, and protein structure prediction34,35,36,37,38,39. A critical and enabling aspect of these machine learning approaches has been access to large, high-quality training datasets. Experimentally, structural databases such as the Cambridge Structural Database (CSD)40 and Protein Data Bank (PDB)41 provide 105 to 106 training examples. Similarly, computational datasets such as QM942,43 and GEOM44 have inspired machine learning research to achieve DFT-level accuracy at a fraction of the computational cost24,25,26. Unfortunately, extensive datasets on macrocycles remain scarce with only dozens to hundreds of available experimental structures. Computational datasets such as PEPCONF45 and SPICE46 datasets have recently begun to provide high-quality conformations at high levels of theory, but are rich in linear peptides; the total number of macrocycles remains insufficient for training machine learning models.

Here, we present a novel computational dataset of macrocyclic peptides, CREMP (Conformer-Rotamer Ensembles of Macrocyclic Peptides), generated with the Conformer-Rotamer Ensemble Sampling Tool (CREST) package47. CREST leverages an iterative metadynamics algorithm with a genetic structure-crossing algorithm to explore diverse geometries, and leverages semiempirical quantum mechanical GFNn-xTB48 calculations to provide better geometries and energy estimates than classical force fields. Initial explorations have demonstrated the potential for CREST to generate diverse macrocycle ensembles that recapitulate key intramolecular hydrogen bonds and the feasibility for ring interconversion, which is further supported by our Technical Validation below. Example conformers from a representative ensemble generated by CREST are illustrated in Fig. 1. We generate a unique set of 36,198 representative 4-, 5-, and 6-mer homodetic cyclic peptides and perform CREST simulations to generate extensive conformer ensembles in chloroform49, providing nearly 31.3 million unique conformers with energy annotations. Chloroform was chosen as the implicit solvation model to approximate the nonpolar environment within cell membranes and facilitate modeling of membrane permeability. Although modeling aqueous environments is critical, especially for docking simulations in biological systems, achieving high-level accuracy requires more involved explicit solvent simulations, which are prohibitive for a dataset of this size within our computational allocation22. In total, this dataset constitutes 3.9 million CPU hours of compute. Summary statistics of the entire dataset are shown in Table 1.

Fig. 1
figure 1

A small macrocyclic peptide ensemble generated with CREST.

Table 1 Dataset statistics for CREMP.

Additionally, building upon the recently published CycPeptMPDB database50, which contains experimental passive membrane permeability data compiled from a variety of literature sources but lacks structural information, we augment a selection of 3,258 6-, 7-, and 10-mer cyclic peptides with their CREST-computed conformer ensembles, yielding nearly 8.7 million additional unique conformers in this CREMP-CycPeptMPDB dataset. To our knowledge, this constitutes the first large-scale dataset containing both complete 3D-structural ensembles along with permeability annotations. We believe that such data will enable enhanced insight into the relationship between 3D structure and permeability. Summary statistics for these data are shown in Table 2.

Table 2 Dataset statistics for CREMP-CycPeptMPDB macrocyclic peptides.

We make these datasets publicly available to provide the research community with a valuable data resource to train and develop machine learning models, and anticipate that it will find broad utility for learning models capable of predicting macrocyclic peptide properties, such as binding affinity, stability, and permeability, among others. We hope the insights gained from these models can be used for rational design of novel macrocycles with improved therapeutic potential. We anticipate that this work will not only accelerate the development of new macrocyclic therapeutics but also contribute to a deeper understanding of the factors governing their conformational behavior, paving the way for more efficient computational approaches in the future.

Methods

Overview

In this study, we used CREST to generate diverse conformer ensembles for a diverse set of macrocyclic peptides, with the aim of developing a representative dataset that captures the complexity of their conformational landscape. CREST was chosen due to its ability to efficiently explore the conformational space while providing a balance between computational cost and accuracy compared to force field-based methods.

Sequence Identities and Processing

Many factors contribute to the diverse conformational shapes of macrocyclic peptides, including ring size, side chains, stereochemistry, and chemical modifications. To gain coverage of a diverse set of molecules, we began with a previously enumerated set of cyclic tetrapeptides from Chan et al.51 and supplemented with additional tetra-, penta-, and hexapeptides, limiting our set to exclusively homodetic peptides. Macrocycle sequences were randomly sampled across three key parameters including 1) side chain - the canonical 20 side chains, 2) stereochemistry - we used both L- and D-amino acids, and 3) N-methylation. The resulting set covered a total of 36,198 unique sequences. All macrocycle sequences were converted to their corresponding canonical SMILES in RDKit to check for duplicates and simple dataset statistics were collected using RDKit (number of atoms, etc.).

Additionally, 3,258 sequences were selected from the CycPeptMPDB50 database spanning both homo- and heterodetic hexa-, hepta-, and decapeptides.

RDKit Conformer Generation

CREST performs best when initialized with a structure that is sufficiently close to the lowest-energy conformer. To generate initial geometries, we used RDKit ETKDGv311,12 with macrocycle torsion preferences to generate conformers using EmbedMultipleConfs with up to 5,000 conformers (numConfs=5000) and random coordinate initialization (useRandomCoords=True), which has been shown to be beneficial for generating macrocycle geometries12. Conformers were subsequently optimized using MMFF9416 as implemented in RDKit and sorted by energy. Finally, the sorted conformers were filtered based on heavy-atom RMSD with a threshold of 0.5 Å.

Optimization with xTB

CREST recommends initialization with a structure optimized at the same level of theory as used for the energy evaluations during the simulation. Therefore, for each molecule, we reoptimized the 1,000 lowest-energy conformers from RDKit/MMFF94 using GFN2-xTB48 and the ALPB solvent model in chloroform49. The resulting lowest-energy optimized structure was used as the single input to CREST.

CREST Simulation

The single xTB-optimized structure was used as input to CREST using solvation in chloroform and default arguments, which include an energy window of 6.0 kcal/mol, an RMSD threshold of 0.125 Å, and an energy threshold between conformers of 0.05 kcal/mol. Each calculation used 14 cores in order to efficiently parallelize across 14 metadynamics runs, and took an average of 2.3 hours for 4-mers, 5.8 hours for 5-mers, and 13.9 hours for 6-mers.

Graph Reidentification

It is possible for low-energy chemical reactions to occur during the CREST simulation, e.g., intramolecular proton transfers, which may produce ensembles that contain different constitutional isomers. We used OpenEye’s OEChem library to reidentify chemical graphs for each conformer in an ensemble and only retained ensembles where all conformers have the same chemical graph.

Data Records

We make the CREMP dataset available online52 at https://doi.org/10.5281/zenodo.7931444 with examples on how to load and analyze data in our public GitHub repository. We provide the data in two available formats, either as Python pickle files, which provide quick read access with RDKit version 2022.09.5 or later, and as text-based SDF files with associated metadata in JSON format. Each file is named based on its linearized amino acid sequence, with residues separated by periods, using standard one-letter codes with lowercase letters representing D-amino acids and Me prefixes representing N-methylated amino acids. The linearized sequences are in no particular order, e.g., C.R.E.M.P and R.E.M.P.C correspond to the same peptide macrocycle (no duplicated macrocycles are present in the dataset as outlined previously). The filename extensions are .pickle, .sdf, and .json.

Each file in pickle contains a Python dictionary with amino acid sequence, SMILES, CREST metadata (energy, entropy, etc.), and a single RDKit rdkit.Chem.Mol object containing all conformers. All conformers of each molecule were validated to be the same structural isomer, and all files were compressed into a single archive. In sdf_and_json, each individual SDF contains all conformers, each associated with its own JSON file that contains CREST metadata. Similarly, all are compressed into another single archive. A single summary CSV is also provided containing sequence, smiles, num_monomers, num_atoms, num_heavy_atoms, along with the CREST metadata totalconfs, uniqueconfs, lowestenergy, poplowestpct, temperature, ensembleenergy, ensembleentropy, and ensemblefreeenergy. The number of unique conformers with different 3D structures is given by uniqueconfs, while totalconfs includes the number of rotamers in addition.

We also make the CREMP-CycPeptMPDB dataset available53 at https://doi.org/10.5281/zenodo.10798261, which contains the conformer ensembles of a subset of macrocyclic peptides available in the CycPeptMPDB database50 containing cell membrane permeability annotations. The format of this data record is identical to that of CREMP.

Technical Validation

In addition to manually verifying the generated ensembles and designing the graph reidentification step so that all conformers in each ensemble have the same chemical graph, we analyzed the generated structures and performed additional characterization of CREST to validate its effectiveness in the space of macrocyclic peptides.

Analysis of Structures

We performed a systematic evaluation of the data generated to visualize and quantify the distributions of conformers within CREMP in order to characterize the conformational and structural diversity of the generated ensembles. The overall dataset statistics are shown in Table 1. The ensembles are diverse in size ranging from ones as small as a single low-energy conformer to a single ensemble with more than 12,000 conformers in the 6 kcal/mol energy window.

Figure 2 visualizes the distributions of ensemble size, energy, entropy, and the occupation probability of the highest-population conformer. The occupation probability distributions demonstrate that it is generally not sufficient to only consider the lowest-energy conformer and that including many low-lying states may be required.

Fig. 2
figure 2

Distributions of CREST-ensemble quantities.

Figure 3 illustrates the conformer ensembles of three different macrocyclic peptides using UMAP54. Each UMAP plot is generated by computing the sine and cosine of each cycle torsion and projecting them into two dimensions across the entire conformer ensemble for each peptide. The distinct plots and several clusters within each plot demonstrate the conformational diversity of backbones within each macrocycle and across peptides. Additionally, each conformer is colored based on its number of internal hydrogen bonds, its radius of gyration, and its energy relative to the lowest-energy conformer in the ensemble. The different values of these properties for each cluster further illustrate the diversity of backbone conformations and the variation within each cluster. In particular, each backbone conformational cluster tends to be associated with a unique number of internal hydrogen bonds, which may stabilize a particular backbone conformation.

Fig. 3
figure 3

Two-dimensional projections of sine- and cosine-transformed cycle torsions generated using UMAP and colored by 3D properties. Each point represents a unique conformer. Each peptide macrocycle has a distinct conformer ensemble characterized by several backbone conformations corresponding to the clusters on the UMAP plots.

Structural Validation with NMR Ensembles

CREST has already been shown to recover relevant conformers of some macrocyclic peptides47, but such assessment has been limited in scope due to low availability of experimental data on structural ensembles of macrocycles. Moreover, we expect both the experimental and computational methods to have a strong effect on the structures. In particular, the continuum solvent approximation ALPB49 is likely not sufficient for strongly polar solvents, particularly when they are hydrogen-bond donors.

To investigate this, we performed a limited analysis on NMR ensembles available in the PDB. In order to characterize the quality of the CREST ensembles computed for the PDB structures, we compute Matching (MAT) scores30. For each molecule:

$$\,{\rm{MAT}} \mbox{-} {\rm{RMSD}}\,\left({{\mathbb{S}}}_{{\rm{CREST}}},{{\mathbb{S}}}_{{\rm{PDB}}}\right)=\frac{1}{| {{\mathbb{S}}}_{{\rm{PDB}}}| }\sum _{{\bf{R}}{\prime} \in {{\mathbb{S}}}_{{\rm{PDB}}}}\mathop{\min }\limits_{{\bf{R}}\in {{\mathbb{S}}}_{{\rm{CREST}}}}\,{\rm{RMSD}}\,\left({\bf{R}},{\bf{R}}{\prime} \right)$$
(1)

where \({{\mathbb{S}}}_{{\rm{CREST}}}\) is the set of conformations generated by CREST and \({{\mathbb{S}}}_{{\rm{PDB}}}\) is the set of experimental NMR structures. R denotes a conformation and RMSD is computed using heavy atoms only. Intuitively, smaller MAT-RMSD scores correspond to CREST ensembles that contain more realistic conformers and are able to recover the relevant NMR conformers. Similarly, we define a MAT score for the ring torsion fingerprint deviation (rTFD):

$$\,{\rm{MAT}} \mbox{-} {\rm{rTFD}}\,\left({{\mathbb{S}}}_{{\rm{CREST}}},{{\mathbb{S}}}_{{\rm{PDB}}}\right)=\frac{1}{| {{\mathbb{S}}}_{{\rm{PDB}}}| }\sum _{{\bf{R}}{\prime} \in {{\mathbb{S}}}_{{\rm{PDB}}}}\mathop{\min }\limits_{{\bf{R}}\in {{\mathbb{S}}}_{{\rm{CREST}}}}\,{\rm{rTFD}}\,\left({\bf{R}},{\bf{R}}{\prime} \right)$$
(2)

where rTFD quantifies how well the torsion angles in the macrocycle match between two conformations R and \({\bf{R}}{\prime} \):12

$$\,{\rm{rTFD}}\,\left({\bf{R}},{\bf{R}}{\prime} \right)=\frac{1}{{N}_{{\rm{torsions}}}}\mathop{\sum }\limits_{i=1}^{{N}_{{\rm{torsions}}}}\frac{| w\left({\tau }_{i}\left({\bf{R}}\right)-{\tau }_{i}\left({\bf{R}}{\prime} \right)\right)| }{\pi }$$
(3)

Here, \({\tau }_{i}\left({\bf{R}}\right)\) computes the i-th torsion angle for conformation R and w() wraps its argument around the cyclic domain [−π, π) on which torsion angles are defined. The π in the denominator normalizes each deviation so that rTFD lies in [0, 1]. Smaller values of MAT-rTFD correspond to CREST ensembles that accurately reproduce the ring torsions of the reference NMR data. We also compute a coverage metric, which measures the fraction of correctly generated conformers within an RMSD threshold of δRMSD30. In other words, the coverage metric quantifies how many experimental conformers are recovered with CREST. It is given by:

$$\,{\rm{Coverage}}\,\left({{\mathbb{S}}}_{{\rm{CREST}}},{{\mathbb{S}}}_{{\rm{PDB}}}\right)=\frac{1}{| {{\mathbb{S}}}_{{\rm{PDB}}}| }\left|\left\{{\bf{R}}{\prime} \in {{\mathbb{S}}}_{{\rm{PDB}}}:\exists {\bf{R}}\in {{\mathbb{S}}}_{{\rm{CREST}}},\,\,{\rm{RMSD}}\,\left({\bf{R}},{\bf{R}}{\prime} \right)\le {\delta }_{{\rm{RMSD}}}\right\}\right|$$
(4)

Figure 4 illustrates the range of MAT-RMSD and MAT-rTFD scores computed between CREST and NMR ensembles in various solvents as well as the coverage over a range of RMSD thresholds. While the performance of CREST is relatively poor in polar solvents, especially water, likely due to the use of implicit solvation, it excels at generating high-quality ensembles in chloroform with MAT-RMSD ~1 Å and MAT-rTFD ~0.1. Moreover, the coverage plot shows that all experimental conformers in chloroform can be recovered within a threshold of ~1.5 Å.

Fig. 4
figure 4

Matching (MAT) scores for RMSD (left), mean coverage at different RMSD thresholds (middle), and MAT for rTFD (right) between CREST ensembles and NMR ensembles of macrocyclic peptides obtained from the PDB in different solvents.

To provide more fine-grained insight into the ensembles in chloroform, we plotted the conformers from six available PDB structures along with the corresponding CREST ensembles on Ramachandran plots55 in Fig. 5 and highlighted the CREST conformers closest to the PDB conformers based on rTFD, i.e., those that result from the minimization in Eq. (2). The high-density regions of the CREST ensemble overlap with the locations of the PDB conformers, thereby further demonstrating the ability of CREST to generate high-quality ensembles in chloroform.

Fig. 5
figure 5

Ramachandran plots comparing CREST ensembles to NMR ensembles of macrocyclic peptides obtained from the PDB.

Usage Notes

Data are provided in two main formats: binary pickle files and text-based SDF and JSON files. SDF and JSON are provided because they are human-readable, software-agnostic, and guarantee maximum interoperability. SDF files can also be read by every major cheminformatics program, which makes them especially suitable for users that are not as familiar with the Python programming environment. The pickle files are provided because they are much faster to load and directly yield RDKit molecules that can be processed in machine learning and cheminformatics workflows. However, pickle files have the downside that they require specific Python packages and sometimes specific versions of those. For the data published here, the Python environment at https://github.com/Genentech/cremp provides details about the required packages and their versions and provides instructions for loading the data. RDKit is a major dependency and version 2022.09.5 or newer is required in order to load the data successfully.