Abstract
Multimodal, single-cell genomics technologies enable simultaneous measurement of multiple facets of DNA and RNA processing in the cell. This creates opportunities for transcriptome-wide, mechanistic studies of cellular processing in heterogeneous cell populations, such as regulation of cell fate by transcriptional stochasticity or tumor proliferation through aberrant splicing dynamics. However, current methods for determining cell types or ‘clusters’ in multimodal data often rely on ad hoc approaches to balance or integrate measurements, and assumptions ignoring inherent properties of the data. To enable interpretable and consistent cell cluster determination, we present meK-means (mechanistic K-means) which integrates modalities through a unifying model of transcription to learn underlying, shared biophysical states. With meK-means we can cluster cells with nascent and mature mRNA measurements, utilizing the causal, physical relationships between these modalities. This identifies shared transcription dynamics across cells, which induce the observed molecule counts, and provides an alternative definition for ‘clusters’ through the governing parameters of cellular processes.
This is a preview of subscription content, access via your institution
Access options
Access Nature and 54 other Nature Portfolio journals
Get Nature+, our best-value online-access subscription
$29.99 / 30 days
cancel any time
Subscribe to this journal
Receive 12 digital issues and online access to articles
$99.00 per year
only $8.25 per issue
Buy this article
- Purchase on SpringerLink
- Instant access to full article PDF
Prices may be subject to local taxes which are calculated during checkout
Similar content being viewed by others
Data availability
Raw FASTQ files or count matrices from publicly available datasets were used for analyses. The links to accession codes for these raw files are in Supplementary Table 1. All processed versions of the publicly available datasets used for analysis are available on CaltechData with the accession codes provided in Supplementary Table 1. Alternatively, all benchmarking and simulated datasets can be downloaded in a combined, compressed format from CaltechData70. The mm10 https://cf.10xgenomics.com/supp/cell-exp/refdata-gex-mm10-2020-A.tar.gz and GRCh38 https://cf.10xgenomics.com/supp/cell-exp/refdata-gex-GRCh38-2020-A.tar.gz (2020-A version) reference genomes used for pseudoalignment were downloaded from 10× Genomics. Source data are provided with this paper.
Code availability
All of the code used to generate the figures and results in the paper, as well as a Google Colaboratory notebook with example usage of meK-means, is available at https://github.com/pachterlab/CGP_2023 and on Zenodo71. Mechanistic K-means is incorporated as a part of the pip installable Monod package65 for single-cell, CME-based parameter inference, whose documentation can be found at https://monod-examples.readthedocs.io/en/latest/.
References
La Manno, G. et al. Molecular architecture of the developing mouse brain. Nature 596, 92–96 (2021).
Chari, T. et al. Whole-animal multiplexed single-cell RNA-seq reveals transcriptional shifts across Clytia medusa cell types. Sci Adv 7, eabh1683 (2021).
Chamberlin, J. T., Lee, Y., Marth, G. T. & Quinlan, A. R. Differences in molecular sampling and data processing explain variation among single-cell and single-nucleus RNA-seq experiments. Genome Res. 34, 179–188 (2024).
Reyes, M., Billman, K., Hacohen, N. & Blainey, P. C. Simultaneous profiling of gene expression and chromatin accessibility in single cells. Adv Biosyst 3, 1900065 (2019).
Xie, H. & Ding, X. The intriguing landscape of single-cell protein analysis. Adv. Sci. 9, e2105932 (2022).
Rabani, M. et al. Metabolic labeling of RNA uncovers principles of RNA production and degradation dynamics in mammalian cells. Nat. Biotechnol. 29, 436–442 (2011).
Munsky, B., Fox, Z. & Neuert, G. Integrating single-molecule experiments and discrete stochastic models to understand heterogeneous gene transcription dynamics. Methods 85, 12–21 (2015).
Xu, Z., Sziraki, A., Lee, J., Zhou, W. & Cao, J. Dissecting key regulators of transcriptome kinetics through scalable single-cell RNA profiling of pooled CRISPR screens. Nat. Biotechnol. 42, 1218–1223 (2023).
Chen, P.-T., Zoller, B., Levo, M. & Gregor, T. Gene activity fully predicts transcriptional bursting dynamics. Preprint at https://arxiv.org/abs/2304.08770 (2023).
Zeng, H. What is a cell type and how to define it? Cell 185, 2739–2755 (2022).
Domcke, S. & Shendure, J. A reference cell tree will serve science better than a reference cell atlas. Cell 186, 1103–1114 (2023).
De Meo, P., Ferrara, E., Fiumara, G. & Provetti, A. Generalized Louvain method for community detection in large networks. In 2011 11th International Conference on Intelligent Systems Design and Applications 88–93 (IEEE, 2011).
Traag, V. A., Waltman, L. & Eck, N. J. From Louvain to Leiden: guaranteeing well-connected communities. Sci. Rep. 9, 5233 (2019).
Yao, Z. et al. A transcriptomic and epigenomic cell atlas of the mouse primary motor cortex. Nature 598, 103–110 (2021).
Chen, S. et al. Dissecting heterogeneous cell populations across drug and disease conditions with PopAlign. Proc. Natl Acad. Sci. USA 117, 28784–28794 (2020).
Cai, B., Zhang, J. & Sun, W. W. Jointly modeling and clustering tensors in high dimensions. Preprint at https://arxiv.org/abs/2104.07773 (2021).
Aibar, S. et al. SCENIC: single-cell regulatory network inference and clustering. Nat. Methods 14, 1083–1086 (2017).
Kiselev, V. Y., Andrews, T. S. & Hemberg, M. Publisher correction: challenges in unsupervised clustering of single-cell RNA-seq data. Nat. Rev. Genet. 20, 310 (2019).
McInnes, L., Healy, J. & Melville, J. UMAP: Uniform Manifold Approximation and Projection for dimension reduction. Preprint at https://arxiv.org/abs/1802.03426 (2018).
Becht, E. et al. Dimensionality reduction for visualizing single-cell data using UMAP. Nat. Biotechnol. 37, 38–44 (2018).
You, Y. et al. Benchmarking UMI-based single-cell RNA-seq preprocessing workflows. Genome Biol. 22, 339 (2021).
Tabula Muris Consortium. Single-cell transcriptomics of 20 mouse organs creates a Tabula Muris. Nature 562, 367–372 (2018).
Han, J. et al. Human serous cavity macrophages and dendritic cells possess counterparts in the mouse with a distinct distribution between species. Nat. Immunol. 25, 155–165 (2024).
Sun, G. et al. A single-cell transcriptomic atlas of the lungs of patients with pulmonary tuberculosis. Preprint at Research Square https://doi.org/10.21203/rs.3.rs-2752256/v1 (2024).
Hjörleifsson, K. et al. Accurate quantification of single-nucleus and single-cell RNA-seq transcripts. Preprint at bioRxiv https://doi.org/10.1101/2022.12.02.518832 (2022).
Sullivan, D. K. et al. kallisto, bustools, and kb-python for quantifying bulk, single-cell, and single-nucleus RNA-seq. Preprint at bioRxiv https://doi.org/10.1101/2023.11.21.568164 (2024).
Bhat, P. et al. Genome organization around nuclear speckles drives mRNA splicing efficiency. Nature 629, 1165–1173 (2024).
Mayère, C. et al. Single-cell transcriptomics reveal temporal dynamics of critical regulators of germ cell fate during mouse sex determination. FASEB J. 35, e21452 (2021).
Xiao, C., Chen, Y., Meng, Q., Wei, L. & Zhang, X. Benchmarking multi-omics integration algorithms across single-cell RNA and ATAC data. Brief. Bioinform. 25, bbae095 (2024).
Heumos, L. et al. Best practices for single-cell analysis across modalities. Nat. Rev. Genet. 24, 550–572 (2023).
Hao, Y. et al. Integrated analysis of multimodal single-cell data. Cell 184, 3573–3587 (2021).
Gayoso, A. et al. Joint probabilistic modeling of single-cell multi-omic data with totalVI. Nat. Methods 18, 272–282 (2021).
Lin, X., Tian, T., Wei, Z. & Hakonarson, H. Clustering of single-cell multi-omics data with a multimodal deep learning method. Nat. Commun. 13, 7705 (2022).
Gupta, R. & Claassen, M. Factorial state-space modelling for kinetic clustering and lineage inference. Preprint at bioRxiv https://doi.org/10.1101/2023.08.21.554135 (2023).
Gorin, G., Fang, M., Chari, T. & Pachter, L. RNA velocity unraveled. PLoS Comput. Biol. 18, e1010492 (2022).
Bokes, P., King, J. R., Wood, A. T. A. & Loose, M. Exact and approximate distributions of protein and mRNA levels in the low-copy regime of gene expression. J. Math. Biol. 64, 829–854 (2012).
Singh, A. & Bokes, P. Consequences of mRNA transport on stochastic variability in protein levels. Biophys. J. 103, 1087–1096 (2012).
Gorin, G. & Pachter, L. Length biases in single-cell RNA sequencing of pre-mRNA. Biophys. Rep. 3, 100097 (2023).
Wolf, F. A., Angerer, P. & Theis, F. J. SCANPY: large-scale single-cell gene expression data analysis. Genome Biol. 19, 15 (2018).
MacQueen, J. et al. Some methods for classification and analysis of multivariate observations. In Proc. 5th Berkeley Symposium on Mathematical Statistics and Probability 281–297 (Univ. California, Berkeley, 1967).
Cao, Z.-J. & Gao, G. Multi-omics single-cell data integration and regulatory inference with graph-linked embedding. Nat. Biotechnol. 40, 1458–1466 (2022).
Argelaguet, R. et al. MOFA+: a statistical framework for comprehensive integration of multi-modal single-cell data. Genome Biol. 21, 111 (2020).
Anders, S. & Huber, W. Differential expression analysis for sequence count data. Genome Biol. 11, R106 (2010).
Xiong, Y. et al. A comparison of mRNA sequencing with random primed and 3′-directed libraries. Sci. Rep. 7, 14626 (2017).
Amarasinghe, S. L. et al. Opportunities and challenges in long-read sequencing data analysis. Genome Biol. 21, 30 (2020).
Andrews, G. L. & Mastick, G. S. R-cadherin is a Pax6-regulated, growth-promoting cue for pioneer axons. J. Neurosci. 23, 9873–9880 (2003).
Kogo, H. et al. HORMAD2 is essential for synapsis surveillance during meiotic prophase via the recruitment of ATR activity. Genes Cells 17, 897–912 (2012).
Liang, J., Shi, J., Wang, N., Zhao, H. & Sun, J. Tuning the protein phosphorylation by receptor type protein tyrosine phosphatase epsilon (PTPRE) in normal and cancer cells. J. Cancer 10, 105–111 (2019).
Koedoot, E., Wolters, L., van de Water, B. & Le Dévédec, S. E. Splicing regulatory factors in breast cancer hallmarks and disease progression. Oncotarget 10, 6021–6037 (2019).
Amodio, N. et al. MALAT1: a druggable long non-coding RNA for targeted anti-cancer approaches. J. Hematol. Oncol. 11, 63 (2018).
Yeo, S. K. et al. Single-cell RNA-sequencing reveals distinct patterns of cell state heterogeneity in mouse models of breast cancer. eLife 9, e58810(2020).
Gökmen-Polar, Y. et al. Splicing factor ESRP1 controls ER-positive breast cancer by altering metabolic pathways. EMBO Rep. 20, e46078 (2019).
Qiao, F.-H., Tu, M. & Liu, H.-Y. Role of MALAT1 in gynecological cancers: pathologic and therapeutic aspects. Oncol. Lett. 21, 333 (2021).
Chen, Q., Zhu, C. & Jin, Y. The oncogenic and tumor suppressive functions of the long noncoding RNA MALAT1: an emerging controversy. Front. Genet. 11, 93 (2020).
Dumitrascu, B., Villar, S., Mixon, D. G. & Engelhardt, B. E. Optimal marker gene selection for cell type discrimination in single cell analyses. Nat. Commun. 12, 1186 (2021).
Chen, X., Chen, S. & Thomson, M. Minimal gene set discovery in single-cell mRNA-seq datasets with ActiveSVM. Nat. Comput. Sci. 2, 387–398 (2022).
Kreutz, C. et al. Encyclopedia of Systems 1576–1579 (Springer, 2013).
Fox, Z. R., Neuert, G. & Munsky, B. Optimal design of single-cell experiments within temporally fluctuating environments. Complexity https://doi.org/10.1155/2020/8536365 (2020).
Carilli, M., Gorin, G., Choi, Y., Chari, T. & Pachter, L. Biophysical modeling with variational autoencoders for bimodal, single-cell RNA sequencing data. Nat. Methods, 21, 1466–1469 (2024).
Sukys, A., Öcal, K. & Grima, R. Approximating solutions of the Chemical Master equation using neural networks. iScience 25, 105010 (2022).
Gorin, G., Carilli, M., Chari, T. & Pachter, L. Spectral neural approximations for models of transcriptional dynamics. Biophys. J. 123, 2892–2901 (2024).
Gorin, G., Vastola, J. J., Fang, M. & Pachter, L. Interpretable and tractable models of transcriptional noise for the rational design of single-molecule quantification experiments. Nat. Commun. 13, 7620 (2022).
Felce, C., Gorin, G. & Pachter, L. A Biophysical model for ATAC-seq data analysis. Preprint at bioRxiv https://doi.org/10.1101/2024.01.25.577262 (2024).
Friedman, N., Cai, L. & Xie, X. S. Stochasticity in gene expression as observed by single-molecule experiments in live cells. Israel J. Chem. 49, 333–342 (2009).
Gorin, G. & Pachter, L. Monod: mechanistic analysis of single-cell RNA sequencing count data. Preprint at bioRxiv https://doi.org/10.1101/2022.06.11.495771 (2022).
Larsson, A. J. M. et al. Genomic encoding of transcriptional burst kinetics. Nature 565, 251–254 (2019).
Bray, N. L., Pimentel, H., Melsted, P. & Pachter, L. Erratum: near-optimal probabilistic RNA-seq quantification. Nat. Biotechnol. 34, 888 (2016).
Melsted, P. et al. Modular, efficient and constant-memory single-cell RNA-seq preprocessing. Nat Biotechnol. 39, 813–818 (2021).
Jiang, S. et al. Cell Taxonomy: a curated repository of cell types with multifaceted characterization. Nucleic Acids Res. 51, D853–D860 (2023).
Chari, T. meK-means all benchmark and simulation datasets. CaltechDATA https://doi.org/10.22002/v4gg9-qsr24 (2024).
Chari, T. & Pachter, L. pachterlab/CGP_2023: meK-means repo DOI (v1.0.0). Zenodo https://doi.org/10.5281/zenodo.13253144 (2024).
Acknowledgements
We thank M. Fang, P. Bhat, C. Felce and L. Luebbert for their helpful feedback on the manuscript and visualizations, and Á. Gálvez-Merchán for their feedback on gene selection in PBMC (blood cell) data. T.C., G.G. and L.P. were funded, in part, by NIH (grant no. 5UM1HG012077-02). The funders had no role in study design, data collection and analysis, decision to publish or preparation of the manuscript.
Author information
Authors and Affiliations
Contributions
T.C. and G.G. conceived the idea for biophysical, multimodal clustering. T.C. developed the algorithm, performed the computations and generated the results and figures. T.C., G.G. and L.P. contributed to interpretation of the results. L.P. supervised the project. All authors discussed the results and contributed to writing and editing the manuscript.
Corresponding author
Ethics declarations
Competing interests
G.G. is an employee of Fauna Bio. The other authors declare no competing interests.
Peer review
Peer review information
Nature Computational Science thanks the anonymous reviewer(s) for their contribution to the peer review of this work. Peer reviewer reports are available. Primary Handling Editor: Jie Pan, in collaboration with the Nature Computational Science team.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Extended data
Extended Data Fig. 1 Length-Bias Model in meK-means.
a) High-level diagram of Input and Output of meK-means (from multimodal data to a matrix of cluster x gene x parameters). meK-means fits data to the Length-Bias Model of transcription, with transcription rate k, mRNA burst size b, splicing rate β, and degradation of mRNA γ. b) Detailed outline of the Length-Bias CME Model. Rates per gene g denoted. Model includes length-dependent technical sampling (C, λ, p) of the biological molecules (Nu, Ns) produced by the transcription processes, which occurs during the sequencing process. Length-dependent capture produces the counts Mu, Ms and the final sequencing-based sampling produces the observed counts U, S, representing the final cell x gene count matrices. Created with BioRender.com.
Supplementary information
Supplementary Information
Algorithm 1, Table 1, Figs. 1–9 and Note 1.
Source data
Source Data Fig. 2
Raw values (csvs) for each plot in Fig. 2.
Source Data Fig. 3
Raw values (csvs) for each plot in Fig. 3.
Source Data Fig. 4
Raw values (csvs) for each plot in Fig. 4.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Chari, T., Gorin, G. & Pachter, L. Biophysically interpretable inference of cell types from multimodal sequencing data. Nat Comput Sci 4, 677–689 (2024). https://doi.org/10.1038/s43588-024-00689-2
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1038/s43588-024-00689-2