Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Article
  • Published:

Biophysically interpretable inference of cell types from multimodal sequencing data

Abstract

Multimodal, single-cell genomics technologies enable simultaneous measurement of multiple facets of DNA and RNA processing in the cell. This creates opportunities for transcriptome-wide, mechanistic studies of cellular processing in heterogeneous cell populations, such as regulation of cell fate by transcriptional stochasticity or tumor proliferation through aberrant splicing dynamics. However, current methods for determining cell types or ‘clusters’ in multimodal data often rely on ad hoc approaches to balance or integrate measurements, and assumptions ignoring inherent properties of the data. To enable interpretable and consistent cell cluster determination, we present meK-means (mechanistic K-means) which integrates modalities through a unifying model of transcription to learn underlying, shared biophysical states. With meK-means we can cluster cells with nascent and mature mRNA measurements, utilizing the causal, physical relationships between these modalities. This identifies shared transcription dynamics across cells, which induce the observed molecule counts, and provides an alternative definition for ‘clusters’ through the governing parameters of cellular processes.

This is a preview of subscription content, access via your institution

Access options

Buy this article

Prices may be subject to local taxes which are calculated during checkout

Fig. 1: Standard clustering results across possible count matrix inputs.
Fig. 2: Mechanistic K-means inference and simulation performance.
Fig. 3: Mechanistic K-means benchmark performance.
Fig. 4: Mechanistic K-means for biological discovery.

Similar content being viewed by others

Data availability

Raw FASTQ files or count matrices from publicly available datasets were used for analyses. The links to accession codes for these raw files are in Supplementary Table 1. All processed versions of the publicly available datasets used for analysis are available on CaltechData with the accession codes provided in Supplementary Table 1. Alternatively, all benchmarking and simulated datasets can be downloaded in a combined, compressed format from CaltechData70. The mm10 https://cf.10xgenomics.com/supp/cell-exp/refdata-gex-mm10-2020-A.tar.gz and GRCh38 https://cf.10xgenomics.com/supp/cell-exp/refdata-gex-GRCh38-2020-A.tar.gz (2020-A version) reference genomes used for pseudoalignment were downloaded from 10× Genomics. Source data are provided with this paper.

Code availability

All of the code used to generate the figures and results in the paper, as well as a Google Colaboratory notebook with example usage of meK-means, is available at https://github.com/pachterlab/CGP_2023 and on Zenodo71. Mechanistic K-means is incorporated as a part of the pip installable Monod package65 for single-cell, CME-based parameter inference, whose documentation can be found at https://monod-examples.readthedocs.io/en/latest/.

References

  1. La Manno, G. et al. Molecular architecture of the developing mouse brain. Nature 596, 92–96 (2021).

  2. Chari, T. et al. Whole-animal multiplexed single-cell RNA-seq reveals transcriptional shifts across Clytia medusa cell types. Sci Adv 7, eabh1683 (2021).

    Article  Google Scholar 

  3. Chamberlin, J. T., Lee, Y., Marth, G. T. & Quinlan, A. R. Differences in molecular sampling and data processing explain variation among single-cell and single-nucleus RNA-seq experiments. Genome Res. 34, 179–188 (2024).

    Article  Google Scholar 

  4. Reyes, M., Billman, K., Hacohen, N. & Blainey, P. C. Simultaneous profiling of gene expression and chromatin accessibility in single cells. Adv Biosyst 3, 1900065 (2019).

    Article  Google Scholar 

  5. Xie, H. & Ding, X. The intriguing landscape of single-cell protein analysis. Adv. Sci. 9, e2105932 (2022).

    Article  Google Scholar 

  6. Rabani, M. et al. Metabolic labeling of RNA uncovers principles of RNA production and degradation dynamics in mammalian cells. Nat. Biotechnol. 29, 436–442 (2011).

    Article  Google Scholar 

  7. Munsky, B., Fox, Z. & Neuert, G. Integrating single-molecule experiments and discrete stochastic models to understand heterogeneous gene transcription dynamics. Methods 85, 12–21 (2015).

    Article  Google Scholar 

  8. Xu, Z., Sziraki, A., Lee, J., Zhou, W. & Cao, J. Dissecting key regulators of transcriptome kinetics through scalable single-cell RNA profiling of pooled CRISPR screens. Nat. Biotechnol. 42, 1218–1223 (2023).

  9. Chen, P.-T., Zoller, B., Levo, M. & Gregor, T. Gene activity fully predicts transcriptional bursting dynamics. Preprint at https://arxiv.org/abs/2304.08770 (2023).

  10. Zeng, H. What is a cell type and how to define it? Cell 185, 2739–2755 (2022).

    Article  Google Scholar 

  11. Domcke, S. & Shendure, J. A reference cell tree will serve science better than a reference cell atlas. Cell 186, 1103–1114 (2023).

    Article  Google Scholar 

  12. De Meo, P., Ferrara, E., Fiumara, G. & Provetti, A. Generalized Louvain method for community detection in large networks. In 2011 11th International Conference on Intelligent Systems Design and Applications 88–93 (IEEE, 2011).

  13. Traag, V. A., Waltman, L. & Eck, N. J. From Louvain to Leiden: guaranteeing well-connected communities. Sci. Rep. 9, 5233 (2019).

    Article  Google Scholar 

  14. Yao, Z. et al. A transcriptomic and epigenomic cell atlas of the mouse primary motor cortex. Nature 598, 103–110 (2021).

    Article  Google Scholar 

  15. Chen, S. et al. Dissecting heterogeneous cell populations across drug and disease conditions with PopAlign. Proc. Natl Acad. Sci. USA 117, 28784–28794 (2020).

  16. Cai, B., Zhang, J. & Sun, W. W. Jointly modeling and clustering tensors in high dimensions. Preprint at https://arxiv.org/abs/2104.07773 (2021).

  17. Aibar, S. et al. SCENIC: single-cell regulatory network inference and clustering. Nat. Methods 14, 1083–1086 (2017).

    Article  Google Scholar 

  18. Kiselev, V. Y., Andrews, T. S. & Hemberg, M. Publisher correction: challenges in unsupervised clustering of single-cell RNA-seq data. Nat. Rev. Genet. 20, 310 (2019).

    Article  Google Scholar 

  19. McInnes, L., Healy, J. & Melville, J. UMAP: Uniform Manifold Approximation and Projection for dimension reduction. Preprint at https://arxiv.org/abs/1802.03426 (2018).

  20. Becht, E. et al. Dimensionality reduction for visualizing single-cell data using UMAP. Nat. Biotechnol. 37, 38–44 (2018).

  21. You, Y. et al. Benchmarking UMI-based single-cell RNA-seq preprocessing workflows. Genome Biol. 22, 339 (2021).

    Article  Google Scholar 

  22. Tabula Muris Consortium. Single-cell transcriptomics of 20 mouse organs creates a Tabula Muris. Nature 562, 367–372 (2018).

    Article  Google Scholar 

  23. Han, J. et al. Human serous cavity macrophages and dendritic cells possess counterparts in the mouse with a distinct distribution between species. Nat. Immunol. 25, 155–165 (2024).

    Article  Google Scholar 

  24. Sun, G. et al. A single-cell transcriptomic atlas of the lungs of patients with pulmonary tuberculosis. Preprint at Research Square https://doi.org/10.21203/rs.3.rs-2752256/v1 (2024).

  25. Hjörleifsson, K. et al. Accurate quantification of single-nucleus and single-cell RNA-seq transcripts. Preprint at bioRxiv https://doi.org/10.1101/2022.12.02.518832 (2022).

  26. Sullivan, D. K. et al. kallisto, bustools, and kb-python for quantifying bulk, single-cell, and single-nucleus RNA-seq. Preprint at bioRxiv https://doi.org/10.1101/2023.11.21.568164 (2024).

  27. Bhat, P. et al. Genome organization around nuclear speckles drives mRNA splicing efficiency. Nature 629, 1165–1173 (2024).

  28. Mayère, C. et al. Single-cell transcriptomics reveal temporal dynamics of critical regulators of germ cell fate during mouse sex determination. FASEB J. 35, e21452 (2021).

  29. Xiao, C., Chen, Y., Meng, Q., Wei, L. & Zhang, X. Benchmarking multi-omics integration algorithms across single-cell RNA and ATAC data. Brief. Bioinform. 25, bbae095 (2024).

  30. Heumos, L. et al. Best practices for single-cell analysis across modalities. Nat. Rev. Genet. 24, 550–572 (2023).

    Article  Google Scholar 

  31. Hao, Y. et al. Integrated analysis of multimodal single-cell data. Cell 184, 3573–3587 (2021).

  32. Gayoso, A. et al. Joint probabilistic modeling of single-cell multi-omic data with totalVI. Nat. Methods 18, 272–282 (2021).

    Article  Google Scholar 

  33. Lin, X., Tian, T., Wei, Z. & Hakonarson, H. Clustering of single-cell multi-omics data with a multimodal deep learning method. Nat. Commun. 13, 7705 (2022).

    Article  Google Scholar 

  34. Gupta, R. & Claassen, M. Factorial state-space modelling for kinetic clustering and lineage inference. Preprint at bioRxiv https://doi.org/10.1101/2023.08.21.554135 (2023).

  35. Gorin, G., Fang, M., Chari, T. & Pachter, L. RNA velocity unraveled. PLoS Comput. Biol. 18, e1010492 (2022).

    Article  Google Scholar 

  36. Bokes, P., King, J. R., Wood, A. T. A. & Loose, M. Exact and approximate distributions of protein and mRNA levels in the low-copy regime of gene expression. J. Math. Biol. 64, 829–854 (2012).

  37. Singh, A. & Bokes, P. Consequences of mRNA transport on stochastic variability in protein levels. Biophys. J. 103, 1087–1096 (2012).

    Article  Google Scholar 

  38. Gorin, G. & Pachter, L. Length biases in single-cell RNA sequencing of pre-mRNA. Biophys. Rep. 3, 100097 (2023).

    Google Scholar 

  39. Wolf, F. A., Angerer, P. & Theis, F. J. SCANPY: large-scale single-cell gene expression data analysis. Genome Biol. 19, 15 (2018).

    Article  Google Scholar 

  40. MacQueen, J. et al. Some methods for classification and analysis of multivariate observations. In Proc. 5th Berkeley Symposium on Mathematical Statistics and Probability 281–297 (Univ. California, Berkeley, 1967).

  41. Cao, Z.-J. & Gao, G. Multi-omics single-cell data integration and regulatory inference with graph-linked embedding. Nat. Biotechnol. 40, 1458–1466 (2022).

  42. Argelaguet, R. et al. MOFA+: a statistical framework for comprehensive integration of multi-modal single-cell data. Genome Biol. 21, 111 (2020).

    Article  Google Scholar 

  43. Anders, S. & Huber, W. Differential expression analysis for sequence count data. Genome Biol. 11, R106 (2010).

    Article  Google Scholar 

  44. Xiong, Y. et al. A comparison of mRNA sequencing with random primed and 3′-directed libraries. Sci. Rep. 7, 14626 (2017).

  45. Amarasinghe, S. L. et al. Opportunities and challenges in long-read sequencing data analysis. Genome Biol. 21, 30 (2020).

    Article  Google Scholar 

  46. Andrews, G. L. & Mastick, G. S. R-cadherin is a Pax6-regulated, growth-promoting cue for pioneer axons. J. Neurosci. 23, 9873–9880 (2003).

    Article  Google Scholar 

  47. Kogo, H. et al. HORMAD2 is essential for synapsis surveillance during meiotic prophase via the recruitment of ATR activity. Genes Cells 17, 897–912 (2012).

    Article  Google Scholar 

  48. Liang, J., Shi, J., Wang, N., Zhao, H. & Sun, J. Tuning the protein phosphorylation by receptor type protein tyrosine phosphatase epsilon (PTPRE) in normal and cancer cells. J. Cancer 10, 105–111 (2019).

    Article  Google Scholar 

  49. Koedoot, E., Wolters, L., van de Water, B. & Le Dévédec, S. E. Splicing regulatory factors in breast cancer hallmarks and disease progression. Oncotarget 10, 6021–6037 (2019).

  50. Amodio, N. et al. MALAT1: a druggable long non-coding RNA for targeted anti-cancer approaches. J. Hematol. Oncol. 11, 63 (2018).

    Article  Google Scholar 

  51. Yeo, S. K. et al. Single-cell RNA-sequencing reveals distinct patterns of cell state heterogeneity in mouse models of breast cancer. eLife 9, e58810(2020).

  52. Gökmen-Polar, Y. et al. Splicing factor ESRP1 controls ER-positive breast cancer by altering metabolic pathways. EMBO Rep. 20, e46078 (2019).

    Article  Google Scholar 

  53. Qiao, F.-H., Tu, M. & Liu, H.-Y. Role of MALAT1 in gynecological cancers: pathologic and therapeutic aspects. Oncol. Lett. 21, 333 (2021).

  54. Chen, Q., Zhu, C. & Jin, Y. The oncogenic and tumor suppressive functions of the long noncoding RNA MALAT1: an emerging controversy. Front. Genet. 11, 93 (2020).

    Article  Google Scholar 

  55. Dumitrascu, B., Villar, S., Mixon, D. G. & Engelhardt, B. E. Optimal marker gene selection for cell type discrimination in single cell analyses. Nat. Commun. 12, 1186 (2021).

    Article  Google Scholar 

  56. Chen, X., Chen, S. & Thomson, M. Minimal gene set discovery in single-cell mRNA-seq datasets with ActiveSVM. Nat. Comput. Sci. 2, 387–398 (2022).

    Article  Google Scholar 

  57. Kreutz, C. et al. Encyclopedia of Systems 1576–1579 (Springer, 2013).

  58. Fox, Z. R., Neuert, G. & Munsky, B. Optimal design of single-cell experiments within temporally fluctuating environments. Complexity https://doi.org/10.1155/2020/8536365 (2020).

  59. Carilli, M., Gorin, G., Choi, Y., Chari, T. & Pachter, L. Biophysical modeling with variational autoencoders for bimodal, single-cell RNA sequencing data. Nat. Methods, 21, 1466–1469 (2024).

  60. Sukys, A., Öcal, K. & Grima, R. Approximating solutions of the Chemical Master equation using neural networks. iScience 25, 105010 (2022).

    Article  Google Scholar 

  61. Gorin, G., Carilli, M., Chari, T. & Pachter, L. Spectral neural approximations for models of transcriptional dynamics. Biophys. J. 123, 2892–2901 (2024).

  62. Gorin, G., Vastola, J. J., Fang, M. & Pachter, L. Interpretable and tractable models of transcriptional noise for the rational design of single-molecule quantification experiments. Nat. Commun. 13, 7620 (2022).

    Article  Google Scholar 

  63. Felce, C., Gorin, G. & Pachter, L. A Biophysical model for ATAC-seq data analysis. Preprint at bioRxiv https://doi.org/10.1101/2024.01.25.577262 (2024).

  64. Friedman, N., Cai, L. & Xie, X. S. Stochasticity in gene expression as observed by single-molecule experiments in live cells. Israel J. Chem. 49, 333–342 (2009).

  65. Gorin, G. & Pachter, L. Monod: mechanistic analysis of single-cell RNA sequencing count data. Preprint at bioRxiv https://doi.org/10.1101/2022.06.11.495771 (2022).

  66. Larsson, A. J. M. et al. Genomic encoding of transcriptional burst kinetics. Nature 565, 251–254 (2019).

    Article  Google Scholar 

  67. Bray, N. L., Pimentel, H., Melsted, P. & Pachter, L. Erratum: near-optimal probabilistic RNA-seq quantification. Nat. Biotechnol. 34, 888 (2016).

  68. Melsted, P. et al. Modular, efficient and constant-memory single-cell RNA-seq preprocessing. Nat Biotechnol. 39, 813–818 (2021).

  69. Jiang, S. et al. Cell Taxonomy: a curated repository of cell types with multifaceted characterization. Nucleic Acids Res. 51, D853–D860 (2023).

    Article  Google Scholar 

  70. Chari, T. meK-means all benchmark and simulation datasets. CaltechDATA https://doi.org/10.22002/v4gg9-qsr24 (2024).

  71. Chari, T. & Pachter, L. pachterlab/CGP_2023: meK-means repo DOI (v1.0.0). Zenodo https://doi.org/10.5281/zenodo.13253144 (2024).

Download references

Acknowledgements

We thank M. Fang, P. Bhat, C. Felce and L. Luebbert for their helpful feedback on the manuscript and visualizations, and Á. Gálvez-Merchán for their feedback on gene selection in PBMC (blood cell) data. T.C., G.G. and L.P. were funded, in part, by NIH (grant no. 5UM1HG012077-02). The funders had no role in study design, data collection and analysis, decision to publish or preparation of the manuscript.

Author information

Authors and Affiliations

Authors

Contributions

T.C. and G.G. conceived the idea for biophysical, multimodal clustering. T.C. developed the algorithm, performed the computations and generated the results and figures. T.C., G.G. and L.P. contributed to interpretation of the results. L.P. supervised the project. All authors discussed the results and contributed to writing and editing the manuscript.

Corresponding author

Correspondence to Lior Pachter.

Ethics declarations

Competing interests

G.G. is an employee of Fauna Bio. The other authors declare no competing interests.

Peer review

Peer review information

Nature Computational Science thanks the anonymous reviewer(s) for their contribution to the peer review of this work. Peer reviewer reports are available. Primary Handling Editor: Jie Pan, in collaboration with the Nature Computational Science team.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Fig. 1 Length-Bias Model in meK-means.

a) High-level diagram of Input and Output of meK-means (from multimodal data to a matrix of cluster x gene x parameters). meK-means fits data to the Length-Bias Model of transcription, with transcription rate k, mRNA burst size b, splicing rate β, and degradation of mRNA γ. b) Detailed outline of the Length-Bias CME Model. Rates per gene g denoted. Model includes length-dependent technical sampling (C, λ, p) of the biological molecules (Nu, Ns) produced by the transcription processes, which occurs during the sequencing process. Length-dependent capture produces the counts Mu, Ms and the final sequencing-based sampling produces the observed counts U, S, representing the final cell x gene count matrices. Created with BioRender.com.

Supplementary information

Supplementary Information

Algorithm 1, Table 1, Figs. 1–9 and Note 1.

Reporting Summary

Peer Review File

Source data

Source Data Fig. 2

Raw values (csvs) for each plot in Fig. 2.

Source Data Fig. 3

Raw values (csvs) for each plot in Fig. 3.

Source Data Fig. 4

Raw values (csvs) for each plot in Fig. 4.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Chari, T., Gorin, G. & Pachter, L. Biophysically interpretable inference of cell types from multimodal sequencing data. Nat Comput Sci 4, 677–689 (2024). https://doi.org/10.1038/s43588-024-00689-2

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1038/s43588-024-00689-2

Search

Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing