Main

Annotating the complexity of protein variation is important for understanding biological processes, identifying disease states and developing effective therapeutics. Proteoform diversity refers to the vast array of protein variations that can exist because of differences in transcription, translation and post-translational modifications (PTMs), which can occur through enzymatic (such as phosphorylation) and non-enzymatic (for example, spontaneous deamidation) processes5. These variations occur independently and in combination with each other on single protein molecules, creating a ‘PTM code’ that has unique and specific roles in driving biological processes6,7,8. The ability to sequence single protein molecules in their natural, full-length state could elucidate this proteoform diversity and its underlying code. However, current methods, including Edman degradation and mass spectrometry, have difficulty analysing full-length proteins from complex samples and face problems with detection sensitivity, dynamic range, analytical throughput and instrumentation cost1,2. To address these challenges, complementary or potentially disruptive platforms for next-generation protein analysis and sequencing have been proposed, including single-molecule fluorescence labelling and affinity-based approaches9,10,11,12,13. However, these emerging techniques also have limitations compared with nanopore technology14, which has the potential to achieve direct, label-free, full-length protein sequencing15.

Nanopore technology consists of a nanometre-sized pore in an insulating membrane that separates two electrolyte-filled wells16. A voltage applied across the membrane drives ionic current flow through the nanopore sensor. When individual analyte molecules pass through the pore, they can generate a detectable change in the signal. This change can provide insight into the molecular nature of the analyte. Although originally viewed, and now commercialized, as a technique for sequencing nucleic acid strands, nanopore sensing has great potential for protein analysis16,17. It has been used for the discrimination of peptides and proteins18,19,20,21,22,23,24,25,26,27,28, the real-time measurement of protein–protein29 and protein–ligand interactions30, and aptamer-mediated protein detection30,31. Protein nanopores have also shown promise in identifying amino acids and PTMs, such as those involved in phosphorylation and glycosylation32,33,34,35, that serve as important biomarkers of cell states and diseases. Previous studies have demonstrated some ability to read DNA-conjugated peptide strands using DNA-processive molecular motors, such as a helicase or a polymerase36,37,38. Furthermore, rereading of peptide fragments using this strategy have made it possible to resolve among a small subset of single amino acid substitutions with high accuracy37. Despite this progress, using nanopores to obtain sequence information from intact, full-length proteins has been hindered by the difficulty of driving long protein strands through the sensor, owing to the neutrally charged polypeptide backbone, varying charge states of amino acid side chains, and stable tertiary structures39.

To overcome the problems of reading full-length proteins, here we have developed a technique to reversibly thread long protein strands into a CsgG pore40 using electrophoresis, and then enzymatically pull them back out of the pore using the protein unfoldase and translocase activity of ClpX41. Unlike the rapid initial stage of threading the protein into the pore using electrophoretic force, the unfoldase-mediated translocation of proteins back out of the pore leads to slow, reproducible ionic current signals. This method has resulted in the processive translocation of long proteins, enabling the detection of single amino acid substitutions and PTMs across protein strands up to hundreds of amino acids in length. We have also developed an approach to rereading the same protein strand multiple times. Furthermore, this method enables the unfolding and translocation of a model folded protein domain for linear, end-to-end analysis.

ClpX-mediated protein reading on an array

We and others have previously developed approaches to unfoldase-mediated protein translocation through nanopores42,43,44, but these methods required complex experimental set-ups and did not demonstrate the single amino acid sensitivity required for sequencing. In our previous methodology, the unfoldase and its cofactors were located in the trans-side solution, opposite to the location of the protein substrate addition42,43. This set-up rendered the method incompatible with commercial high-throughput nanopore sensor-array devices, such as MinION from Oxford Nanopore Technologies, which do not allow access to the trans compartment solution. To overcome the need for trans motor addition, we designed a more streamlined two-step process. First, the protein substrate is threaded into the nanopore by electrophoretic force (cis-to-trans). Then, ClpX is added to the cis solution to steadily pull the substrate protein back out of the pore (trans-to-cis) (Fig. 1a).

Fig. 1: Nanopore protein reading using an unfoldase.
figure 1

a, Schematic of the cis-based unfoldase approach on the MinION platform. The roman numerals correspond to the ionic current states in b. b, Example trace of protein P1. Deep spikes in the capture state are hypothesized to be transient structural fluctuations of the Smt3 domain in the pore. State iii can be discerned from a transient drop in current when the ClpX solution is initially loaded into the flow cell. c, Ensemble traces for protein P1 (blue, n = 34) and mutants P2 (purple, n = 17), P3 (orange, n = 21) and P4 (red, n = 12). Protein sequences are oriented from C to N, with all mutation regions shown in colour.

We first synthesized a protein to evaluate this method. It comprised an unstructured, negatively charged N-terminal sequence of 42 amino acids rich in glycine, serine and aspartic acid, to facilitate electrophoretic capture in the pore, attached to a stably folded domain (Smt3). This was followed by a short, positively charged sequence (an RGS repeat) and a ClpX-binding ssrA tag at the C-terminal end (protein P1; Supplementary Fig. 1). The RGS and the folded domain were included to inhibit complete translocation of the protein through the pore, thereby preserving the accessibility of the ssrA tag in the cis compartment. After introducing P1 into a MinION R9.4.1 flow cell incorporating a CsgG pore variant (Oxford Nanopore Technologies)3 and applying a voltage of −180 mV, we observed current blockades associated with the capture of the negatively charged protein tail in the pores. To test the ability of ClpX to extract the captured protein from the nanopore, we then introduced a buffer solution containing ClpX and ATP into the flow cell. Under these conditions, we observed deep ionic current blockades, characteristic of capture of the substrate protein in the nanopore, returning to the open channel state in a stepwise manner some time after the addition of ClpX. We also determined that these events were ATP dependent and occurred at a slower rate in the presence of ATPγS (Extended Data Fig. 1), an ATP analogue that is more difficult for ClpX to hydrolyse45. These results are consistent with our model that ClpX was binding to the ssrA tag and translocating the captured protein out of the nanopore with C-to-N-terminal directionality.

If this were true, we reasoned that mutations in the tail domain of the protein would induce alterations in the ionic current states observed during ClpX-mediated translocation of the protein through the nanopore. To test this, we synthesized three new proteins (P2, P3 and P4), each containing several tyrosine mutations at distinct positions of the polyGSD sequence (Supplementary Fig. 1). To compare the signal profiles of the four protein sequences directly, we created ensemble ionic current traces for each of these proteins, as shown in Fig. 1c (the individual traces are in Supplementary Fig. 2). This revealed that the main differences across the translocation signals corresponded with the positions of the tyrosine mutations along the protein strands. Moreover, comparing all-versus-all signal dynamic time warping (DTW) distances revealed that the sets of translocation signals generated by each unique protein sequence formed distinct clusters, differentiating them from every other protein. This was statistically supported by permutational multivariate analysis of variance (PERMANOVA) with P < 1 × 10−6 for each comparison after applying a Bonferroni correction.

Resolving ClpX steps and substitutions

After establishing a cis-based ClpX approach, we next sought to investigate the sensitivity of this method to single amino acids as a first step towards developing a long-read protein analysis method. To do this, we designed protein constructs with five repeating sequence blocks, each containing 59 amino acids. These blocks were built with a base sequence of glycine, serine, aspartic acid and glutamic acid (Fig. 2a and Supplementary Fig. 1). We introduced a single unique amino acid mutation at the central position in each block and demarcated the blocks with a double tyrosine mutation at each end. This spacing was intended to avoid entanglement of the signal contributions from single amino acid mutations by preventing them from occupying the pore at the same time. This hypothesis was grounded on prior observations indicating that around 20 amino acids can occupy the CsgG sensing region when in a stretched conformation46. We termed these strategically designed protein constructs ‘proteins for amino acid sequencing through optimized regions’ (PASTORs). We synthesized a total of eight different PASTOR variants, each containing a different sequence of mutations. The PASTOR design allowed us to analyse up to five different mutations in a single nanopore read, and the total set of eight PASTORs (C to N: HDKER, GNQST, FYWCP, AVLIM, VGDNY, TWAFH, PRMQE and KSILC) enabled us to investigate each of the 20 amino acids in two different PASTOR sequence contexts.

Fig. 2: Characterizing single amino acid substitutions and ClpX stepping in PASTORs.
figure 2

a, PASTOR sequence composition. b, Filtered nanopore current trace of PASTOR–HDKER. Colour boundaries are defined by automated YY segmentation. c, Top, an example PASTOR trace. The red boxes show the manually segmented YY dips. Bottom, the black horizontal lines denote the mean of individual steps. d, Distribution of the mean number of residues per step in each of the YY dips; n = 776 YY dips. e, Step dwell-time distribution. f, Average signal trace for the transformed VRs of each amino acid after Euclidian alignment of all the VRs equidistantly stretched to the same length. The VRs of a charged amino acid are shown as a dashed line (n of VRs and experiments are shown in Extended Data Table 1).

ClpX-mediated analysis of the PASTOR proteins manifested ionic current traces containing repetitive patterns that resulted from the seven YY mutations preceding the return to the open channel state, seen as seven repeated dips in the signal (PASTOR–HDKER is shown in Fig. 2b and the other PASTORs are shown in Supplementary Fig. 3). Between these dips, distinctive and reproducible variations in the ionic current signals were observed, corresponding to the variable amino acid mutation in each block. Using the consistent, substantial effect of the YY mutations, we segregated the current signals into regions termed ‘YY dips’ and ‘variable regions’ (VRs) (Supplementary Fig. 4), and used these patterns to scale and segment our signals (Methods and Extended Data Fig. 2).

A close examination of the YY-dip signals revealed rapid, stepwise changes in the current level, which we reasoned must be caused by single ClpX substrate translocation steps (Fig. 2c). Previous estimates of the step size of these movements vary: single-molecule tweezer experiments suggested that ClpX translocates 5–8 amino acids per step45,46, whereas structural studies on ClpX-related protein-remodelling machines imply 2 amino acids per step47,48. To determine the step size of ClpX in our experiments, we analysed these YY-dip regions using a segmentation algorithm (examples are in Supplementary Figs. 5 and 6), filtering out dips with back-stepping or excessive noise. By dividing the number of residues contributing to the YY dips by the number of steps identified per YY dip (Methods and Extended Data Fig. 3a,b), we determined that ClpX translocates an average of approximately 1.96 residues per step (s.d. = 0.25; Fig. 2d). This was confirmed by a secondary segmentation algorithm yielding a similar mean of around 1.89 residues per step (s.d. = 0.28; Extended Data Fig. 3c–e). The dwell time of each step, capturing the duration ClpX pauses between pulling events, had a mean of 28.6 ms (s.d. = 32.3 ms; Fig. 2e). These results are in strong agreement with the step size of two amino acids hypothesized from the structural studies, and indicate that the tweezer experiments lacked the spatio-temporal resolution to resolve individual ClpX steps.

After establishing the two-residue stepping behaviour of ClpX, we shifted our focus to the VRs to explore the ionic signatures of individual amino acid mutations. Our analysis revealed that in VRs with a neutral amino acid mutation, there was a negative correlation between the ionic current levels and the volume of the amino acid (Fig. 2f and Extended Data Fig. 4a). This observation supports a volume-exclusion model in which larger amino acids block more current than their smaller counterparts do. Interestingly, the VRs containing positively charged residues (K and R) decreased the current level below the baseline sequence, whereas negatively charged residues increased it, diverging from the volume-exclusion model. This effect was greater for negatively charged residues than for positively charged ones. One possible explanation for this is that the negatively charged residues resist translocation to the negatively charged cis compartment, causing the protein strand to stretch and thereby decrease the total volume of protein in the pore. Conversely, a positively charged residue would be attracted to the cis compartment and could introduce upstream kinks in the protein strand, adding to the protein volume in the pore. The impact on signal levels could also be attributed to variations in solvation states and the mobility of ions near the charged amino acids. We hypothesize that some of the variability between VRs of the same amino acid can be attributed to non-enzymatic PTMs, such as asparagine deamidation, which is discussed in Extended Data Fig. 4b–h, Supplementary Note 1 and Supplementary Fig. 7. Collectively, these results show that this method is sensitive to single amino acid residues.

Sequence to nanopore signal modelling

Considering the relationship between the volume and the charge of individual amino acids, and their effect on nanopore signals, we developed a biophysical model designed to simulate nanopore signals from the amino acid sequence of a protein directly. This model, which builds on previous findings46, determines a summation of the volume and charge of amino acids in a moving 20-residue window, applying a centrally positioned negative parabolic weight (Extended Data Fig. 5a–d). Extended Data Fig. 5e shows the signal generated by our model for the PASTOR–TWAFH protein sequence aligned to an actual nanopore trace of the same protein. Model signals for all proteins in this study are shown in Supplementary Fig. 8. To evaluate our model quantitatively, we computed the DTW distance between the actual ionic current traces and the model, and compared that with the distribution of distances between the actual ionic current trace and the model of random sequences. We found that, on average, the model of the true sequence ranked in the top 0.3% of the best matches (Extended Data Fig. 5f), which confirms that the signal agreement observed in Extended Data Fig. 5e was not caused by artefacts from DTW alignment. This reinforces the assertion that our model has the capacity to simulate these current traces accurately in these sequence contexts.

Building a single-molecule aminocaller

Sequencing synthetic protein constructs such as PASTORs could serve diverse technological applications, including protein barcoding46,49. We addressed this by initially training machine learning models to identify the single mutation present in a VR. This process consisted of filtering and scaling each of the raw signal traces, followed by segmentation of the VR signal regions (Fig. 3a). To featurize the VR signals, we used a combination of manually curated features and DTW-distance features (Methods). We next explored several classical and deep machine learning models and found that random forests most frequently achieved the highest accuracy. All classification analyses were then done with a hyperparameter-tuned random forest evaluated on a fixed held-out test set, unless otherwise specified. We first evaluated the discrimination of all pairs of amino acids (Fig. 3b and Extended Data Fig. 6a). Pairs of amino acids with dissimilar volumes, or in which one was negatively charged, achieved the highest accuracy classifications. For example, tyrosine versus aspartate exhibited 100% discrimination accuracy. Some pairs of amino acids, such as leucine and isoleucine, proved to be more challenging, owing to their inherent physico-chemical similarities. Amino acids with high variance signals, such as cysteine, were also more difficult to distinguish from others. We then moved to training models to classify among particular sets of three amino acids (for example, G, Y and D) in which the model achieved 95% single-read accuracy. Expanding this to 5-way classifications (for example, G, V, W, R and D), the model maintained high performance, achieving an accuracy rate of 86% (Extended Data Table 2 and Extended Data Fig. 6b). In the most challenging task, a 20-way amino acid classification, our top-performing model substantially outperformed a dummy classifier, obtaining an accuracy of 28% compared with just 5.5% for the dummy. When we considered top-N accuracy measurements, our model attained 67% accuracy for top-5 and 81% for top-8 accuracy in the 20-way classification task (Extended Data Fig. 6c).

Fig. 3: Single-molecule nanopore sequencing of single amino acid mutations.
figure 3

a, Pipeline for PASTOR VR classification with machine learning (ML) models. b, Heatmap showing test accuracies in discriminating between all pairs of amino acid VR mutations, averaged over five random forests (n for VRs and experiments is shown in Extended Data Table 1). c, Example sequencing traces in the test set, for PASTOR-HDKER. Transformed ionic current traces are plotted with a box around the variable regions defined by the YY segmenter. The colour intensity of the boxes represents the ranking of the true class in the aminocaller’s prediction for each VR. For the 5-way classification task (top box shading), the classes are the five mutations found in that protein, whereas the 20-way classification task (bottom box shading) considers all possible amino acid classes. In each box, the letter corresponds to the model’s top prediction. Darker shades denote a more-accurate prediction, indicating that the correct label ranked high in the model’s predictions.

Building on these results, we integrated our classifiers downstream of the PASTOR segmenter to develop an end-to-end PASTOR ‘aminocaller’. We then amino-called a set of PASTOR reads from the classification test set (Fig. 3c and Supplementary Fig. 9). Overall sequencing accuracy per read averaged about 62% and 42% for the HDKER sequence, and roughly 51% and 21% for the AVLIM sequence, using 5-way and 20-way classification models respectively.

Unfoldase slip sequence enables rereading

After developing the aminocaller, we aimed to improve the accuracy of our single-molecule sequencing approach by developing a method to reread single protein molecules. A previous study suggested that ClpX may have difficulty gripping particular polypeptide sequences, such as polyproline, on which the ClpXP complex showed slow degradation rates41. This prompted us to hypothesize that incorporating a ‘slippery’ amino acid sequence near the N terminus of a PASTOR would induce ClpX to momentarily lose its grip of the strand (Fig. 4a). Consequently, the substrate protein would be free to rethread into the pore by electrophoresis. Rethreading would stop enzyme-mediated translocation, and it would resume once ClpX regains its grip on the substrate. To test this strategy, we constructed a new PASTOR (PASTOR-reread) with two important sequence features: first, a proline-rich ‘slip’ sequence repeat (EPPPP)5 positioned near the N terminus; and second, VRs separated by an increasing number of tyrosine residues, ranging from two to five (Supplementary Fig. 1). We reasoned that the distinct current levels that our signal-to-sequence model predicted for each of the repeats would enable us to estimate the slip distance (Supplementary Fig. 8). Indeed, nanopore signals produced by PASTOR-reread generally exhibited repeated signal patterns that closely aligned with our model’s prediction before returning to the open channel state (Fig. 4b and Supplementary Fig. 10). By using the tyrosine repeat regions as a measure of slip length, we observed that slipping distances were usually either short ranges (50–100 amino acids) or extended across the entire PASTOR unstructured region (more than 300 amino acids), accounting for roughly 40% and 30% of all rereads, respectively (Extended Data Fig. 7a). The effect of ClpX concentration on slipping distance and frequency is discussed in Supplementary Note 2 and Extended Data Fig. 7b,c.

Fig. 4: Rereading single protein molecules multiple times with an unfoldase slip sequence.
figure 4

a, Working model of rereading. b, Top box shows example trace of PASTOR-reread showing three almost-complete reread events (blue trace). Our model’s predicted signal for the PASTOR-reread sequence (pink trace) was aligned to each reread. The fourth VR contains an asparagine mutation, but the corresponding signal level consistently resembles aspartate in all three instances of this PASTOR-reread trace. The modelled sequence was changed to contain an aspartate to reflect the putative PTM. Bottom box shows the approximate region of the strand that is in the nanopore over time.

We next investigated the potential of single-molecule rereading for increasing sequencing accuracy. Using a simulation discussed in Supplementary Note 3, the accuracy for the 20-way amino acid classification task improved from 28% to 61% (compared with a 5% random baseline) with 10 rereads (Extended Data Fig. 7d). Likewise, the accuracy for a 7-way classification task improved from 66% to 99% (against a 14% random baseline).

Simulating protein barcode sequencing

Having determined the capacity for high-accuracy sequencing through PASTOR rereading and the ability to design PASTOR proteins with customizable VR sequences, we simulated the PASTOR VR sequence space with varying constraints, with a view to applications in protein barcoding. On the basis of the accuracy rates of our models (Extended Data Fig. 7d), we computed the number of distinct barcodes that could be generated at a given accuracy level. This calculation considered varying degrees of rereading and two different VR segment numbers per protein barcode (five and ten VRs). For example, our findings indicate that with ten VRs and ten rereads, it is feasible to generate libraries of more than 1 million or 1 billion unique PASTOR barcodes that are decodable with a single-molecule accuracy of more than 95% or more than 81%, respectively (Extended Data Fig. 7e and Supplementary Table 1).

Monitoring and mapping enzymatic PTMs

Demonstrating the ability to detect and map phosphorylations across long protein strands would be an important step towards developing a technology capable of identifying distinct full-length proteoforms. To do this, we focused on two serine/threonine protein kinases with distinct recognition motifs: protein kinase A (PKA), which recognizes the canonical motif RRXS, and casein kinase II (CKII), which targets the sequence SXXD/E50. To see if we could effectively characterize the differential activity of these two kinases using our nanopore reading approach, we designed a new substrate, PASTOR-phos (Fig. 5a and Supplementary Figs. 1 and 11). In this design, we inserted the substrate peptide LRRASLG (‘kemptide’) of PKA into one of the VRs, making it specific for recognizing the kinase51. To investigate CKII in PASTOR-phos, we used the original 29-amino acid linker sequences from the PASTOR, which inherently contain a CKII motif, while reducing the total number of VRs.

Fig. 5: Single-molecule mapping of kinase phosphorylation activity.
figure 5

a, Traces of PASTOR-phos, in which each section (C- and N-terminal linkers, VR V, VR A and VR GLSARRL) is aligned to the lowest DTW-distance phosphorylation state model (Supplementary Fig. 12). Phosphorylation is indicated by the letter ‘P’ in a red circle. YY dips (denoted by pale grey boxes) are aligned to the model of a YY dip; n for traces and experiments is provided in Extended Data Table 3. b, Ensemble traces of VR GLSARRL for each condition. c, Relative frequency for each condition of molecules best matching each proteoform (Supplementary Table 3). CKII conditions are stacked. The phosphorylation count of proteoforms is shown above the bars (proteoform ID1 contains no phosphorylations).

After incubating PASTOR-phos with PKA for 1 h, we performed nanopore analysis, which found a substantial increase in ionic current in 91 of the 92 reads (98.9%) in the kemptide VR compared with the baseline (Fig. 5b and Extended Data Fig. 8a). This increase in ionic current is consistent with expectations for the negatively charged phosphoserine, which carries a charge of −2, to enhance ionic flow. Conversely, 361 of the 368 non-kemptide VRs and linker sequences (98.1%) showed no substantial signal changes. These results are consistent with PKA activity being specific to the RRXS motif.

When the same substrate (PASTOR-phos) was treated with CKII for 1 h, we observed high read-to-read variability manifested by large increases in current levels that were found to be concentrated in the eight linker sequences containing the CKII phosphorylation motif (Fig. 5a). The maximum peak transformed current of the VRs and linker sequences incubated in CKII showed a significant increase compared with that of the PKA and the no-kinase (blank) and incubation conditions (Extended Data Fig. 8a). This indicates that the method can discern site-specific phosphorylation events, demonstrating specificity to CKII. Interestingly, a small portion of linkers had signal increases much higher than the others, indicating that they were being phosphorylated to a greater extent (Fig. 5a and Extended Data Fig. 8a). Analysis of the phosphorylated linker sequence revealed that phosphorylation at the initial motif induces the formation of a secondary CKII motif, SXXpS, which has been described previously50. We hypothesized that the linkers with much higher signal levels were phosphorylated at both serine positions. To test this, we reasoned that extending the incubation time of CKII with the substrate should increase the frequency of both single and, consequently, double phosphorylation events. Supporting this hypothesis, data from PASTOR-phos after a 26-h incubation revealed increased occurrences of both putative single and double phosphorylations in the linker sequences (Extended Data Fig. 8b,c).

Fig. 6: Processive reading of folded protein domains.
figure 6

a, Working model of ClpX-mediated processing of folded proteins. The roman numerals correspond to the ionic current states in b. b, Example trace of PASTOR-titin. c, Example traces of titin translocation (state vii), with black horizontal lines denoting the mean of individual putative ClpX steps, found with the Bayesian segmentation algorithm (Methods).

Given the abundance of potential CKII phosphorylation sites in the PASTOR-phos sequence, numerous combinatorially unique proteoforms are possible (a total of 13,122). To map our reads to these various modified forms, we integrated phosphoserine into our sequence-to-signal model (Methods). This approach allowed us to align nanopore traces with the predicted sequence-to-signal profiles for each phosphorylation state across all VRs and linkers, as shown in Supplementary Fig. 12. Consequently, we identified and quantitatively assessed more than 100 distinct full-length proteoforms of PASTOR-phos, across reads obtained from the baseline, PKA and CKII experiments (Fig. 5c, Supplementary Fig. 13 and Supplementary Tables 2 and 3). For example, the 26-h CKII incubation resulted in single molecules containing as many as nine phosphorylated residues.

Processive reading of folded domains

Progressing beyond synthetic, unstructured sequences, we next evaluated the effectiveness of our cis-based unfoldase method on protein sequences that contain a folded domain. For this purpose, we analysed a PASTOR protein with the titin I27V15P domain, which consists of 89 amino acids arranged into 8 β-strands52, inserted into the third VR position (PASTOR-titin; Supplementary Fig. 1). Unlike unstructured proteins, nanopore traces of PASTOR-titin yielded an initial two-step electrophoretic nanopore capture state, indicating that the folded titin domain was first captured on the nanopore at state ii and then electrically unfolded at transition state iii to produce the typical PASTOR capture signal at state iv, manifested by the Smt3 domain on the pore (Fig. 6a and Extended Data Fig. 9a). After adding ClpX to the cis compartment, we observed a translocation signal corresponding to the leading VRs and YY regions, which are tethered to the C terminus of the titin domain (state v). Subsequently, we observed a distinct and deep blockade state that we interpreted as ClpX attempting to unfold the titin domain (state vi), which presumably refolds in the trans compartment after the initial translocation. This deep blockade state often reverted back to the previous state, indicating an unsuccessful ClpX unfolding attempt and indicating that ClpX slipped back on the protein strand52,53. After a successful unfolding attempt, we observed putative translocation of the titin domain (state vii; Fig. 6b and Supplementary Fig. 14). In the titin translocation signal, individual ClpX steps could be seen (Fig. 6c). After translocation of the titin domain, we observed characteristic signal features corresponding to the downstream VRs and YY regions (state viii) before transitioning back to an open-pore state (state ix).

To confirm our understanding of the unfolding and translocation states, we performed experiments using a variant of titin I27 (PASTOR-D-titin) with a destabilized tertiary structure, introduced through double-point mutations (C47E and C63E) on two buried cysteines43,52. Comparing PASTOR-D-titin (Extended Data Fig. 9b) with PASTOR-titin allowed us to explore the effect of the tertiary structure of titin on the resulting current signals. This generated two notable differences: first, PASTOR-D-titin displayed unique signal features at the putative unfolding state vi, indicating structural disparities between the two variants; and second, states v and vi were observed only once in PASTOR-D-titin before the presumptive translocation state vii, in contrast to PASTOR-titin in which they were typically observed multiple times, leading to a substantial difference between the distribution of the unfolding times of PASTOR-titin and PASTOR-D-titin (Extended Data Fig. 9c). These differences can be attributed to the more stable, unfolding-resistant titin domain of PASTOR-titin compared with that of PASTOR-D-titin. In PASTOR-titin, repeated observations of states v and vi, which were not present for PASTOR-D-titin, support the conclusion that they result from unsuccessful unfolding attempts and ClpX back-slipping events triggered by the stable titin domain. Also, despite their dissimilar structural stabilities, PASTOR-titin and PASTOR-D-titin demonstrated similar signals during the putative translocation state vii (Extended Data Fig. 9d–f). This similarity reflects their nearly identical primary amino acid sequences. The observation of similar signals at the proposed translocation state vii between PASTOR-titin and PASTOR-D-titin, despite their differences in structural stability, underscores the role of the primary amino acid sequence in this process. It indicates that the primary sequence is the main determinant of the translocation signal through the nanopore, whereas structural variations have greater influence on the preceding unfolding state.

We next tested PASTOR constructs with the amyloid-β protein 1–42 (PASTOR-Aβ42) and its shorter derivative 1–15 (PASTOR-Aβ15), which have distinct amino acid sequences and lengths from the titin domain. We reasoned that Aβ42 and Aβ15 would generate brief unfolding states because they are partly but not fully structured in their monomeric forms54. As expected, after nanopore analysis they yielded ionic current traces similar to PASTOR-D-titin overall but with distinct features in unfolding state vi (Extended Data Fig. 9g and Supplementary Fig. 14). Furthermore, comparing their putative translocation states (state vii) using DTW distance, we observed that the signals generated by PASTOR-Aβ42 and PASTOR-Aβ15 share similarities to, but are distinct from, signals generated by PASTOR-titin and PASTOR-D-titin (Extended Data Fig. 9d–f), reflective of the translocation state being dependent on protein primary sequence. Overall, the dwell times of these different states also correspond well with their respective sequence lengths across all the PASTOR proteins, indicating a translocation rate of around 63 amino acids per second (an average dwell time of about 16 ms per amino acid) (Extended Data Fig. 9h). This is close to previous estimates of ClpX translocation speed and the observation that the rate of ClpX-mediated protein translocation is relatively constant regardless of protein sequence43,55.

Finally, we assessed our predictive model using these proteins. Because the model does not factor in the signal features linked with unfolding, we analysed the signal segment after the unfolding state until the completion of the translocation (states vii–viii). Using the same comparison technique as previously implemented for the PASTOR protein models, we found that the average model of the PASTORs containing folded domains ranked in the top 0.04% of the best matches (Extended Data Fig. 9i). This evaluation is evidence that our model can adequately simulate these current traces in the specified sequence contexts.

Discussion

We have introduced a new approach for single-molecule reading of long protein strands using nanopores and an unfoldase motor protein. This method achieves single-amino-acid sensitivity and demonstrates the capability to reread and sequence amino acid substitutions in defined sequence contexts across long protein strands. This could immediately advance protein barcoding technology, as we project the ability to design libraries of synthetic peptide sequences (more than 1 billion). Moreover, we applied our method to detect and map the activities of distinct kinases, achieving site-specific detection of enzymatic PTMs along extended protein sequences and the relative quantification of more than 100 putative proteoforms of a single synthetic protein substrate. The implications and challenges of PTM detection are outlined in Supplementary Discussion 1.

We also established that ClpX translocates proteins through the nanopore in a stepwise manner, in agreement with structural studies that suggested a fundamental step size of two amino acids. We then laid the groundwork for a biophysical model capable of simulating nanopore signals that are generated when individual protein sequences are pulled through the nanopore by the unfoldase. This result could eventually enable a ‘lookup table’ approach reminiscent of mass spectrometry, facilitating full-length, single-molecule protein identification and fingerprinting (Supplementary Discussion 2). However, further work is needed to explore more native protein sequences. Furthermore, directly determining amino acid sequences from the nanopore signal (de novo sequencing) will be even more difficult and will require extensive datasets for training de novo amino acid callers.

Finally, we have demonstrated the full-length reading of a model folded protein domain, which is an important result as we move towards reading natural protein molecules. In the present system, electrophoretic protein unfolding (cis to trans; state iii in Fig. 6a) and ClpX-mediated protein unfolding (trans to cis; state vii in Fig. 6a) are key to achieving full-length folded-domain analysis. It is likely that some protein domains will exhibit greater resistance to unfolding than the substrates explored in this study. In such cases, extra strategies could be used to facilitate unfolding, such as the use of denaturants and electro-osmotic flow56. As we begin to turn our attention to natural proteins, this methodology will require synthetic N- and C-terminal sequences that can be appended using existing terminus-specific chemical-conjugation techniques39,57,58 (Supplementary Discussion 3).

One factor that currently impedes data collection throughput is the two-step flow cell loading process. Looking ahead, we can imagine a system that operates continuously, with the unfoldase prebound to the protein analyte but prevented from initiating unfolding activity until the protein strand is captured by the pore. This method, which mirrors strategies developed for the nanopore sequencing of DNA59, has the potential to greatly increase throughput. There are challenges with experimental yield (Extended Data Fig. 10 and Supplementary Discussion 4) and read-out accuracy, particularly for domains with secondary structure (Supplementary Discussion 5), and these can be addressed in future work. In conclusion, this work serves as a stepping stone towards full-length protein identification, capable of achieving the highest level of proteoform resolution. Furthermore, it promises immediate advances, particularly in the context of protein barcoding and PTM-monitoring applications.

Methods

Expression and purification of proteins

Plasmids for analyte proteins were constructed using gBlocks (Integrated DNA Technologies) inserted into the pET–49b(+) plasmid (Novagen), with a dihydrofolate reductase domain, a polyhistidine tag and a TEV cleavage site upstream of the sequence encoding an analyte protein. The NEBuilder HiFi DNA assembly and Q5 site-directed mutagenesis kits (New England Biolabs) were used for plasmid construction. Cloning was done using NEB 5-α-competent Escherichia coli cells. Plasmid sequences were verified by Sanger sequencing through Genewiz. Protein expression was induced overnight at 30 °C with BL21 (DE3) E. coli cells in Overnight Express Instant TB medium (Novagen). Proteins were purified by immobilized metal affinity chromatography (IMAC) with TALON metal affinity cobalt resin and its associated buffer set (Takara), following the manufacturer’s instructions. Proteins were cleaved with TEV protease (New England Biolabs) and further purified by reverse IMAC. Purified proteins were concentrated using ultracentrifugal filters with a 10 kDa cutoff (Amicon) and stored in the short term at 4 °C or in the long term at −80 °C until use.

A covalently linked hexamer of an N-terminal truncated ClpX variant (ClpX-ΔN6)60 was prepared using the BLR E. coli strain as described previously43. In brief, cells were grown to an optical density at 600 nm (OD600) of around 0.6 in LB medium and then incubated in the presence of 0.5 mM isopropyl β-D-1-thiogalactopyranoside (IPTG) at 23 °C for about 3 h to induce ClpX expression. ClpX was purified by IMAC and anion-exchange chromatography. Purified ClpX was stored at −80 °C in small aliquots until use. ClpP expression was induced at an OD600 of around 0.6 with 0.5 mM IPTG at 30 °C for about 3 h43. ClpP was purified by IMAC and stored at −80 °C until use.

PTM assays

For asparagine deamidation, protein (around 1 mg ml−1) was incubated overnight in 100 mM sodium bicarbonate buffer (pH 9.6) at 25 °C to catalyse deamidation. For protein phosphorylation with kinase, protein was incubated with either 50,000 units per ml PKA (New England Biolabs) or 10,000 units per ml CKII (New England Biolabs) in a protein kinase buffer (10 mM MgCl2, 0.1 mM EDTA, 2 mM DTT, 0.01% Brij 35, 260 µM ATP and 50 mM Tris-HCl, pH 7.5) at 30 °C. The protein solution was used for nanopore analysis immediately after the incubation without purification.

MinION experiments

All the experiments were done on the MinION platform using R9.4.1 flow cells. Run conditions were set with a custom MinKNOW script (available from Oxford Nanopore Technologies) at a temperature of 30 °C and a constant voltage of −140 mV with a 3 kHz sampling frequency, except for initial proteins P1–P4, for which runs were performed at a constant voltage of −180 mV with a 10 kHz sampling frequency. Using the priming port, flow cells were first washed with 1 ml cis running buffer (200 mM KCl, 5 mM MgCl2, 10% glycerol and 25 mM HEPES–KOH, pH 7.6) and then loaded with 200 μl protein analyte in cis running buffer at a final concentration of 500 nM, unless otherwise specified. Following the observation of protein captures in the pores, flow cells were washed with 1 ml cis running buffer to remove uncaptured proteins and subsequently loaded with 75 μl cis running buffer supplemented with 4 mM ATP and 200 nM ClpX-ΔN6 unless otherwise specified. The flow cell was washed about 4 min after analyte loading in the initial method, and around 6 min and 2 min after analyte loading at concentrations of 5 nM and 500 nM, respectively, in the optimized method (Extended Data Fig. 10a). For MinION runs in the high-salt condition (Extended Data Fig. 6b), a buffer containing 400 mM KCl, 5 mM MgCl2 and 25 mM HEPES–KOH (pH 7.6) was used instead of standard cis running buffer to see if it would improve the signal-to-noise ratio.

Bulk degradation assays

The time-course degradation assay of the PASTOR-HDKER protein was performed in cis running buffer with 6 μM PASTOR-HDKER, 150 nM ClpX-ΔN6, 300 nM ClpP14 and an ATP-regeneration mix (4 mM ATP, 16 mM creatine phosphate and 7 units per ml creatine phosphokinase) at 30 °C. Incubation was stopped by denaturing samples in Laemmli buffer at 95 °C for 5 min. Samples were run on SDS–PAGE and stained with Coomassie blue to quantify the protein bands using the ImageJ software.

Nanopore signal analysis

Preprocessing

To help identify ClpX-mediated protein translocations, we established detection thresholds using specific statistical parameters (standard deviation, median value, standard deviation of the mean of windows, and the ratio of values relative to the open pore value) indicative of translocation to ionic current blockades preceding a return to the open channel state. This analysis was used to assist the process of manually checking traces for translocations, and translocations with particularly high noise or disruptions were discarded. PASTOR proteins were auto-segmented as described below, with the exception of those containing folded domains and PASTOR-rereads, which were segmented manually. PASTOR-reread rereads with a complete Y2–Y3–Y4–Y5–Y2 signal were assumed to be full-length reads with a back-slipping distance of 310 amino acids. Partial rereads missing the signal(s) of the C-terminal Y2, Y3, Y4 and Y5 were assigned to have back-slipping distances of 250, 188, 125 and 61 amino acids, respectively. All figures with raw traces (those shown in pA) had a low-pass Bessel filter applied using SciPy with N = 10 and Wn = 0.025, except for those showing stepping analysis (Figs. 2c and 6c, Extended Data Fig. 3 and Supplementary Figs. 5 and 6), which had Wn = 0.7. Before use in data analysis, traces were smoothed by applying a low-pass Bessel filter with N = 10 and Wn = 0.03 with SciPy, and by applying average downsampling by a factor of 50 for proteins P1–4, 20 for the 8 PASTORs and 10 for the other proteins. Then, to scale, the segment was split into tenths, and the median of the minima of each tenth and the median of the maxima of each tenth were used as the min and max, respectively, to perform min–max scaling (Extended Data Fig. 2b). For PASTOR-phos, the signals were iteratively scaled. We first used this approach, then DTW-aligned traces to two canonical presegmented traces and selected the alignment with the lowest DTW distance. The max value of the N-terminal VR was multiplied by 1.4, and the max value of VR GLSARRL was multiplied by 1.2, and the minimal max was used as the max value for min–max scaling. This was repeated after realigning to the canonical traces and segmenting the VRs. Unless otherwise specified, ‘normalized’ refers to z-score normalization, as in ‘normalized current’ when comparing a model signal with experimental signals.

Signal alignment

To align signals, we used DTW61 and normalized the DTW distances by dividing by the sum of the lengths of the two signals. To describe the similarity of a set of traces, we computed the DTW distance between all pairs of traces. In t-distributed stochastic neighbor embedding (t-SNE) plots, we then clustered traces on the vector of its DTW distances to all other traces. To create ensemble traces, we first identified the trace with the lowest mean DTW distance to all other traces and stretched it to create Tmedoid = [t1, t2,.., tn], where n is the mean length of all traces. We then DTW-aligned every other trace to Tmedoid and created Tconsensus = [median(alignments to t1), median(alignments to t2), …, median(alignments to tn)]. Ensemble traces in Fig. 1c, Fig. 5b and Extended Data Fig. 9d show all traces aligned to the Tconsensus, but do not plot Tconsensus.

Protein sequence-to-signal model

To describe the amino acids, we used their volumes62 and their charges at pH 7.6, at which the histidine residue is assumed to be neutral. The volume of phosphoserine was estimated as 126.6 cm3 mol−1, on the basis of a linear regression of molecular weight to volume of the other residues. The model signal, S = [S1, S2, ..., Sn–19], of amino acid sequence [aa1, aa2, …, aan] is calculated by computing the signal for each of the n–19 windows of width 20 (Extended Data Fig. 5a–d). The vector Xi describes the window starting at index i in the sequence. The j-th index in Xi is 1 + Vc × volume(aai+j) + Pc × PositiveCharge(aai+j) + Nc × NegativeCharge(aai+j), for 0 ≤ j < 20, where the functions PositiveCharge and NegativeCharge take 1 if the residue has a positive or negative charge, respectively, and 0 otherwise. The constants representing weights between charge and volume, Vc = −3.9 × 10−3, Nc = 4.08 × 10−1 and Pc = −8.16 × 10−2, were determined empirically to minimize the average post-DTW distance of a training subset of protein traces to the model of their sequences. To weight the values in Xi, we use a vector PW (parabolic weight) of length 20 containing values representing a negative, centrally positioned parabolic curve. The i-th index in S is then finally computed as the dot product of Xi and PW.

ClpX step identification

For this analysis, the signals were not scaled or downsampled. They were filtered with a low-pass Bessel filter with N = 10 and Wn = 0.7. For this analysis, YY dips were extracted manually, including portions of the signal that would otherwise be considered part of the VR in this study, to best capture the entire portion for which the double tyrosines contribute to the signal. The number of residues per YY dip was calculated as pw/d, where p is the mean proportion of the total translocation dwell time spent in these regions (0.318; Extended Data Fig. 3a), w is the total number of reading windows in the sequence (359; Extended Data Fig. 1) and d is the number of YY dips per read (6). We primarily used a Bayesian-based algorithm63 to identify steps, unless otherwise noted. When applying this algorithm, a minimum length of 10 observations and a threshold of 18 was used. A total of 776 YY-dip regions were analysed, comprising 45% of all the YY dips in the dataset, omitting dips affected by potential backstepping (non-monotonic steps) or excessive noise. This selection was made by excluding YY dips that did not follow the pattern of the mean of each segmented step monotonically decreasing to the minimum and then monotonically increasing. A secondary t-test-based algorithm64 was also used to confirm the results of the stepping rate, which was used in a different study of ClpX stepping behaviour65. When using the t-test-based algorithm, a minimum window length of 10 observations and a threshold P-value of 5 × 10−5 were used, and a total of 456 dips were analysed.

YY segmentation

To identify the YY dips and VRs, a single PASTOR trace was segmented manually into each coloured section in Fig. 2a, and the remainder of the traces were aligned to it with DTW. The corresponding regions were assigned the label from the one manually segmented trace (Supplementary Fig. 4). For PASTOR-phos, two canonical traces were segmented manually, and the rest of the traces were aligned to both, and then labels were assigned according to the canonical trace with the lowest DTW distance.

VR classification

We used scikit-learn to develop and test classical machine learning models and Pytorch to develop and test convolutional neural-network models. The test set was composed of all current traces from a given set of experiments to create an out-of-sample test set. The set of test experiments was selected using linear programming (Python package Pulp) to ensure at least 12 VRs with each amino acid in the test set, and minimizing the test set size. We decided to use 12 because it gave the closest to an 80–20 train–test split: 79.6% of the VRs were in the training set and 20.4% were in the testing set (full counts are shown in Extended Data Table 1a). In classification tasks for which only VRs corresponding to a subset of amino acids were used, the test set was composed of a subset of this test set. We performed hyperparameter tuning with scikit-optimize on the training set using 5-fold cross-validation. The optimal parameters were: n_estimators = 250, min_samples_leaf = 2, max_features = ‘log2’, max_depth = 20, ccp_alpha = 0.0001, class_weight = ‘balanced_subsample’ and criterion = ‘gini’. All the results in Fig. 3b,c, Extended Data Fig. 6, Extended Data Table 2 and Supplementary Fig. 9 are from models evaluated on the test set. All the VRs containing an asparagine with a maximum transformed value above 1.3 had their labels changed to aspartate. In training all classical models, we upsampled minority classes, such that there was an equal representation of all classes in the training set. When training the convolutional neural network (CNN) in Extended Data Fig. 6c, we weighted the loss inversely proportional to each label’s class representation in the training set. To featurize the VRs, we performed principal component analysis on the vector of its DTW distances to all VRs in the training set to reduce the size of the vector to 64. We also used the median, max, middle, mean, dip, mean absolute value of the derivative and median absolute value of the derivative of the transformed signals, as well as the standard deviation of the raw (unfiltered, unscaled) signal. The CNN had the transformed signal as input. It was trained with a stochastic gradient descent optimizer with a learning rate of 0.01, had four convolutional layers followed by a gated recurrent unit (GRU) and then a fully connected layer, and was initialized with Kaiming initialization. Max pooling and a ReLU activation function were applied after each convolutional layer. The dummy classifier was implemented with the scikit-learn dummy classifier with default parameters.

Reread simulation

To collect the results shown in Extended Data Fig. 7d,e, we used a random forest without hyperparameter tuning and used 100 randomly selected 80–20 train–train splits. This was necessary to estimate the accuracy well enough with a large number of rereads, given the data limitation and the need to group samples in the test set.

Barcode error correction

To calculate the accuracy of barcode identification when using linear error-correcting codes, we started with our accuracy, pVR, of identifying a VR given an alphabet size, a, of 2, 4, 8 or 16. For a given a and number of VRs, L, we calculated the number of bits, n = L × log2(a), that could be encoded in a protein. We simulated the accuracy with error correction, p′, when n−k of the bits were allocated to linear error-correcting codes, for all integers k = 1 to n. We did this by conducting 50,000 trials of: first, encoding a random integer from 0 to 2k with a generating matrix into a message of n bits; second, randomly and independently, with probability pVR, changing each of the n/log2(a) consecutive sets of log2(a) bits in the encoded message (to a different set of bits of the same length) to simulate misclassifying one VR; and third, decoding the number with syndrome decoding. We calculated p′ to be the percentage of trials in which the decoded number was the same as the original random number.

Phosphorylation detection

Each section (C-terminal linker, VR V, VR GLSARRL, VR A and N-terminal linker) was extracted with YY segmentation. For each section, the transformed current was aligned to the model of all possible phosphorylation states, shown in Supplementary Fig. 12. We determined the number of phosphorylations in each section by the number of phosphorylations in the best-matching (lowest DTW distance) phosphorylation-state model (Supplementary Table 2) to the actual trace. When describing the signal increase in VR GLSARRL caused by PKA (Extended Data Fig. 8a), only the portion of the section up to the (n/3)-th index, where n is the length of the YY-segmented VR GLSARRL, was used because that is where PKA causes the signal to increase, as seen in Fig. 6b.

Null-hypothesis tests

All PERMANOVA tests were done on the DTW distance matrix of signals using scikit-bio and 106 permutations, unless we used a Bonferroni correction, in which case n× 106 permutations were used, where n is the number of comparisons performed. Kruskal–Wallis, T and Mann–Whitney U tests were performed using SciPy. Reported P values were multiplied by n if we noted that we used a Bonferroni correction. All tests were two-sided unless stated otherwise, and P values were considered significant if P < 0.05.

Materials availability

Protein expression plasmids are available at Addgene.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.