  • Article
  • Published:

Biophysically interpretable inference of cell types from multimodal sequencing data


Multimodal, single-cell genomics technologies enable simultaneous measurement of multiple facets of DNA and RNA processing in the cell. This creates opportunities for transcriptome-wide, mechanistic studies of cellular processing in heterogeneous cell populations, such as regulation of cell fate by transcriptional stochasticity or tumor proliferation through aberrant splicing dynamics. However, current methods for determining cell types or ‘clusters’ in multimodal data often rely on ad hoc approaches to balance or integrate measurements, and assumptions ignoring inherent properties of the data. To enable interpretable and consistent cell cluster determination, we present meK-means (mechanistic K-means) which integrates modalities through a unifying model of transcription to learn underlying, shared biophysical states. With meK-means we can cluster cells with nascent and mature mRNA measurements, utilizing the causal, physical relationships between these modalities. This identifies shared transcription dynamics across cells, which induce the observed molecule counts, and provides an alternative definition for ‘clusters’ through the governing parameters of cellular processes.

Fig. 1: Standard clustering results across possible count matrix inputs.
Fig. 2: Mechanistic K-means inference and simulation performance.
Fig. 3: Mechanistic K-means benchmark performance.
Fig. 4: Mechanistic K-means for biological discovery.

Data availability

Raw FASTQ files or count matrices from publicly available datasets were used for analyses. The links to accession codes for these raw files are in Supplementary Table 1. All processed versions of the publicly available datasets used for analysis are available on CaltechData with the accession codes provided in Supplementary Table 1. Alternatively, all benchmarking and simulated datasets can be downloaded in a combined, compressed format from CaltechData70. The mm10 and GRCh38 (2020-A version) reference genomes used for pseudoalignment were downloaded from 10× Genomics. Source data are provided with this paper.

Code availability

All of the code used to generate the figures and results in the paper, as well as a Google Colaboratory notebook with example usage of meK-means, is available at and on Zenodo71. Mechanistic K-means is incorporated as a part of the pip installable Monod package65 for single-cell, CME-based parameter inference, whose documentation can be found at


We thank M. Fang, P. Bhat, C. Felce and L. Luebbert for their helpful feedback on the manuscript and visualizations, and Á. Gálvez-Merchán for their feedback on gene selection in PBMC (blood cell) data. T.C., G.G. and L.P. were funded, in part, by NIH (grant no. 5UM1HG012077-02). The funders had no role in study design, data collection and analysis, decision to publish or preparation of the manuscript.

Author information

Authors and Affiliations



T.C. and G.G. conceived the idea for biophysical, multimodal clustering. T.C. developed the algorithm, performed the computations and generated the results and figures. T.C., G.G. and L.P. contributed to interpretation of the results. L.P. supervised the project. All authors discussed the results and contributed to writing and editing the manuscript.

Corresponding author

Correspondence to Lior Pachter.

Ethics declarations

Competing interests

G.G. is an employee of Fauna Bio. The other authors declare no competing interests.

Peer review

Peer review information

Nature Computational Science thanks the anonymous reviewer(s) for their contribution to the peer review of this work. Peer reviewer reports are available. Primary Handling Editor: Jie Pan, in collaboration with the Nature Computational Science team.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Fig. 1 Length-Bias Model in meK-means.

a) High-level diagram of Input and Output of meK-means (from multimodal data to a matrix of cluster x gene x parameters). meK-means fits data to the Length-Bias Model of transcription, with transcription rate k, mRNA burst size b, splicing rate β, and degradation of mRNA γ. b) Detailed outline of the Length-Bias CME Model. Rates per gene g denoted. Model includes length-dependent technical sampling (C, λ, p) of the biological molecules (Nu, Ns) produced by the transcription processes, which occurs during the sequencing process. Length-dependent capture produces the counts Mu, Ms and the final sequencing-based sampling produces the observed counts U, S, representing the final cell x gene count matrices. Created with

Supplementary information

Supplementary Information

Algorithm 1, Table 1, Figs. 1–9 and Note 1.

Reporting Summary

Peer Review File

Source data

Source Data Fig. 2

Raw values (csvs) for each plot in Fig. 2.

Source Data Fig. 3

Raw values (csvs) for each plot in Fig. 3.

Source Data Fig. 4

Raw values (csvs) for each plot in Fig. 4.

