Differential abundance testing on single-cell data using k-nearest neighbor graphs

Dann, Emma; Henderson, Neil C.; Teichmann, Sarah A.; Morgan, Michael D.; Marioni, John C.

doi:10.1038/s41587-021-01033-z

Article
Published: 30 September 2021

Differential abundance testing on single-cell data using k-nearest neighbor graphs

Nature Biotechnology volume 40, pages 245–253 (2022)Cite this article

45k Accesses
146 Citations
129 Altmetric
Metrics details

Subjects

Abstract

Current computational workflows for comparative analyses of single-cell datasets typically use discrete clusters as input when testing for differential abundance among experimental conditions. However, clusters do not always provide the appropriate resolution and cannot capture continuous trajectories. Here we present Milo, a scalable statistical framework that performs differential abundance testing by assigning cells to partially overlapping neighborhoods on a k-nearest neighbor graph. Using simulations and single-cell RNA sequencing (scRNA-seq) data, we show that Milo can identify perturbations that are obscured by discretizing cells into clusters, that it maintains false discovery rate control across batch effects and that it outperforms alternative differential abundance testing strategies. Milo identifies the decline of a fate-biased epithelial precursor in the aging mouse thymus and identifies perturbations to multiple lineages in human cirrhotic liver. As Milo is based on a cell–cell similarity structure, it might also be applicable to single-cell data other than scRNA-seq. Milo is provided as an open-source R software package at https://github.com/MarioniLab/miloR.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on Springer Link
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

**Fig. 1: Detecting perturbed cell states as differentially abundant graph neighborhoods.**

**Fig. 2: Milo outperforms alternative differential abundance testing approaches and controls for false discoveries in the presence of batch effects.**

**Fig. 3: Milo efficiently scales to large datasets.**

**Fig. 4: Milo identifies the decline of a fate-biased precursor in the aging mouse thymus.**

**Fig. 5: Milo identifies the compositional disorder in cirrhotic liver.**

Multi-level cellular and functional annotation of single-cell transcriptomes using scPipeline

Article Open access 28 October 2022

Bayesian inference of gene expression states from single-cell RNA-seq data

Article 29 April 2021

CASi: A framework for cross-timepoint analysis of single-cell RNA sequencing data

Article Open access 09 May 2024

Code availability

Milo is implemented as an open-source package in R (https://github.com/MarioniLab/miloR) and is installable from Bioconductor (≥3.13; http://www.bioconductor.org/packages/release/bioc/html/miloR.html). Code used to generate figures and perform analyses can be found at https://github.com/MarioniLab/milo_analysis_2020.

References

Kiselev, V. Y., Andrews, T. S. & Hemberg, M. Challenges in unsupervised clustering of single-cell RNA-seq data. Nat. Rev. Genet. 20, 273–282 (2019).
CAS PubMed Google Scholar
Ramachandran, P. et al. Resolving the fibrotic niche of human liver cirrhosis at single-cell level. Nature 575, 512–518 (2019).
CAS PubMed PubMed Central Google Scholar
Baran-Gale, J. et al. Ageing compromises mouse thymus function and remodels epithelial cell differentiation. eLife 9, e56221 (2020).
CAS PubMed PubMed Central Google Scholar
Pijuan-Sala, B. et al. A single-cell molecular map of mouse gastrulation and early organogenesis. Nature 566, 490–495 (2019).
CAS PubMed PubMed Central Google Scholar
Haber, A. L. et al. A single-cell survey of the small intestinal epithelium. Nature 551, 333–339 (2017).
CAS PubMed PubMed Central Google Scholar
Lun, A. T. L., Richard, A. C. & Marioni, J. C. Testing for differential abundance in mass cytometry data. Nat. Methods 14, 707–709 (2017).
CAS PubMed PubMed Central Google Scholar
Zhao, J. et al. Detection of differentially abundant cell subpopulations discriminates biological states in scRNA-seq data. Proc. Natl Acad. Sci. USA 118, e2100293118 (2021).
CAS PubMed PubMed Central Google Scholar
Burkhardt, D. B. et al. Quantifying the effect of experimental perturbations at single-cell resolution. Nat. Biotechnol. 39, 619–629 (2021).
CAS PubMed PubMed Central Google Scholar
Gut, G., Tadmor, M. D., Pe’er, D., Pelkmans, L. & Liberali, P. Trajectories of cell-cycle progression from fixed cell populations. Nat. Methods 12, 951–954 (2015).
CAS PubMed PubMed Central Google Scholar
McCarthy, D. J., Chen, Y. & Smyth, G. K. Differential expression analysis of multifactor RNA-seq experiments with respect to biological variation. Nucleic Acids Res. 40, 4288–4297 (2012).
CAS PubMed PubMed Central Google Scholar
Robinson, M. D., McCarthy, D. J. & Smyth, G. K. edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics 26, 139–140 (2010).
CAS PubMed Google Scholar
Robinson, M. D. & Oshlack, A. A scaling normalization method for differential expression analysis of RNA-seq data. Genome Biol. 11, R25 (2010).
PubMed PubMed Central Google Scholar
Benjamini, Y. & Hochberg, Y. Multiple hypotheses testing with weights. Scand. J. Statist. 24, 407–418 (1997).
Google Scholar
Soneson, C. & Robinson, M. D. Bias, robustness and scalability in single-cell differential expression analysis. Nat. Methods 15, 255–261 (2018).
CAS PubMed Google Scholar
Cannoodt, R., Saelens, W., Deconinck, L. & Saeys, Y. Spearheading future omics analyses using dyngen, a multi-modal simulator of single cells. Nat. Communications 12, 1–9 (2021).
Google Scholar
Luecken, M. et al. Benchmarking atlas-level data integration in single-cell genomics. Preprint at https://www.biorxiv.org/content/10.1101/2020.05.22.111161v2 (2020).
Tran, H. T. N. et al. A benchmark of batch-effect correction methods for single-cell RNA sequencing data. Genome Biol. 21, 12 (2020).
CAS PubMed PubMed Central Google Scholar
Chazarra-Gil, R., van Dongen, S., Kiselev, V. Y. & Hemberg, M. Flexible comparison of batch correction methods for single-cell RNA-seq using BatchBench. Nucleic Acids Res. 49, e42 (2021).
CAS PubMed PubMed Central Google Scholar
Haghverdi, L., Lun, A. T. L., Morgan, M. D. & Marioni, J. C. Batch effects in single-cell RNA-sequencing data are corrected by matching mutual nearest neighbors. Nat. Biotechnol. 36, 421–427 (2018).
Article CAS PubMed PubMed Central Google Scholar
Stoeckius, M. et al. Cell Hashing with barcoded antibodies enables multiplexing and doublet detection for single cell genomics. Genome Biol. 19, 224 (2018).
CAS PubMed PubMed Central Google Scholar
McGinnis, C. S. et al. MULTI-seq: sample multiplexing for single-cell RNA sequencing using lipid-tagged indices. Nat. Methods 16, 619–626 (2019).
CAS PubMed PubMed Central Google Scholar
Akiyama, T. et al. The tumor necrosis factor family receptors RANK and CD40 cooperatively establish the thymic medullary microenvironment and self-tolerance. Immunity 29, 423–437 (2008).
CAS PubMed Google Scholar
Hikosaka, Y. et al. The cytokine RANKL produced by positively selected thymocytes fosters medullary thymic epithelial cells that express autoimmune regulator. Immunity 29, 438–450 (2008).
CAS PubMed Google Scholar
Wilkinson, A. L., Qurashi, M. & Shetty, S. The role of sinusoidal endothelial cells in the axis of inflammation and cancer within the liver. Front. Physiol. 11, 990 (2020).
PubMed PubMed Central Google Scholar
Foldi, I. et al. Lectin-complement pathway molecules are decreased in patients with cirrhosis and constitute the risk of bacterial infections. Liver Int. 37, 1023–1031 (2017).
CAS PubMed Google Scholar
Ganesan, L. P. et al. FcγRIIb on liver sinusoidal endothelium clears small immune complexes. J. Immunol. 189, 4981–4988 (2012).
CAS PubMed Google Scholar
Sato, K. et al. Ductular reaction in liver diseases: pathological mechanisms and translational significances: liver injury and regeneration. Hepatology 69, 420–430 (2019).
PubMed Google Scholar
Morell, C. M., Fabris, L. & Strazzabosco, M. Vascular biology of the biliary epithelium: biliary epithelium vascular biology. J. Gastroenterol. Hepatol. 28, 26–32 (2013).
PubMed PubMed Central Google Scholar
Mariotti, V., Fiorotto, R., Cadamuro, M., Fabris, L. & Strazzabosco, M. New insights on the role of vascular endothelial growth factor in biliary pathophysiology. JHEP Rep. 3, 100251 (2021).
PubMed PubMed Central Google Scholar
R Core Team. R: A Language and Environment for Statistical Computing. https://www.R-project.org (R Foundation for Statistical Computing, 2017).
Büttner, M., Ostner, J., Müller, C. l., Theis, F. J. & Schubert, B. scCODA: a Bayesian model for compositional single-cell data analysis. Preprint at https://www.biorxiv.org/content/10.1101/2020.12.14.422688v2 (2020).
Dixit, A. et al. Perturb-seq: dissecting molecular circuits with scalable single-cell RNA profiling of pooled genetic screens. Cell 167, 1853–1866 (2016).
CAS PubMed Google Scholar
Datlinger, P. et al. Pooled CRISPR screening with single-cell transcriptome readout. Nat. Methods 14, 297–301 (2017).
CAS PubMed Google Scholar
Jaitin, D. A. et al. Dissecting immune circuits by linking CRISPR-pooled screens with single-cell RNA-seq. Cell 167, 1883–1896 (2016).
CAS PubMed Google Scholar
Stoeckius, M. et al. Simultaneous epitope and transcriptome measurement in single cells. Nat. Methods 14, 865–868 (2017).
CAS PubMed PubMed Central Google Scholar
Cao, J. et al. Joint profiling of chromatin accessibility and gene expression in thousands of single cells. Science 361, 1380–1385 (2018).
CAS PubMed PubMed Central Google Scholar
Chen, S., Lake, B. B. & Zhang, K. High-throughput sequencing of the transcriptome and chromatin accessibility in the same cell. Nat. Biotechnol. 37, 1452–1457 (2019).
CAS PubMed PubMed Central Google Scholar
Zhu, C. et al. An ultra high-throughput method for single-cell joint analysis of open chromatin and transcriptome. Nat. Struct. Mol. Biol. 26, 1063–1070 (2019).
CAS PubMed PubMed Central Google Scholar
Ma, S. et al. Chromatin potential identified by shared single-cell profiling of RNA and chromatin. Cell 183, 1103–1116 (2020).
CAS PubMed PubMed Central Google Scholar
Luecken, M. D. & Theis, F. J. Current best practices in single-cell RNA-seq analysis: a tutorial. Mol. Syst. Biol. 15, e8746 (2019).
PubMed PubMed Central Google Scholar
Setty, M. et al. Wishbone identifies bifurcating developmental trajectories from single-cell data. Nat. Biotechnol. 34, 637–645 (2016).
CAS PubMed PubMed Central Google Scholar
Griffiths, J. & Lun, A. MouseGastrulationData: single-cell transcriptomics data across mouse gastrulation and early organogenesis. https://github.com/MarioniLab/MouseGastrulationData (2021).
Csardi, G. & Nepusz, T. The igraph software package for complex network research. InterJournal http://www.interjournal.org/manuscript_abstract.php?361100992 (2006).
Huber, W. et al. Orchestrating high-throughput genomic analysis with Bioconductor. Nat. Methods 12, 115–121 (2015).
CAS PubMed PubMed Central Google Scholar
Gentleman, R. C. et al. Bioconductor: open software development for computational biology and bioinformatics. Genome Biol. 5, R80 (2004).
PubMed PubMed Central Google Scholar
Ritchie, M. E. et al. limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Res. 43, e47 (2015).
PubMed PubMed Central Google Scholar
Kuleshov, M. V. et al. Enrichr: a comprehensive gene set enrichment analysis web server 2016 update. Nucleic Acids Res. 44, W90–W97 (2016).
CAS PubMed PubMed Central Google Scholar
Lun, A. T. L., McCarthy, D. J. & Marioni, J. C. A step-by-step workflow for low-level analysis of single-cell RNA-seq data with Bioconductor. F1000Res. 5, 2122 (2016).
PubMed PubMed Central Google Scholar
Yu, G., Wang, L.-G., Han, Y. & He, Q.-Y. clusterProfiler: an R package for comparing biological themes among gene clusters. OMICS 16, 284–287 (2012).
CAS PubMed PubMed Central Google Scholar

Download references

Acknowledgements

We thank S. Ghazanfar for feedback on the method; N. Kumasaka for comments on the manuscript; C. Suo, V. Kedlian, R. Elmentaite, J. P. Pett, K. Tuong and B. Stewart for feedback on the software package; and D. Burkhardt, M. Luecken and W. Lewis for discussions on benchmarking. J.C.M. acknowledges core funding from the European Molecular Biology Laboratory and core funding from Cancer Research UK (C9545/A29580), which supports M.D.M. E.D. and S.A.T. acknowledge Wellcome Sanger core funding (WT206194). N.C.H. is supported by a Wellcome Trust Senior Research Fellowship in Clinical Science (ref. 219542/Z/19/Z), the Medical Research Council and a Chan Zuckerberg Initiative Seed Network Grant.

Author information

Authors and Affiliations

Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton, Cambridge, UK
Emma Dann, Sarah A. Teichmann & John C. Marioni
Centre for Inflammation Research, The Queen’s Medical Research Institute, University of Edinburgh, Edinburgh, UK
Neil C. Henderson
MRC Human Genetics Unit, Institute of Genetics and Molecular Medicine, University of Edinburgh, Edinburgh, UK
Neil C. Henderson
Theory of Condensed Matter Group, The Cavendish Laboratory, University of Cambridge, Cambridge, UK
Sarah A. Teichmann
European Molecular Biology Laboratory European Bioinformatics Institute, Hinxton, Cambridge, UK
Michael D. Morgan & John C. Marioni
Cancer Research UK Cambridge Institute, Li Ka Shing Centre, University of Cambridge, Cambridge, UK
Michael D. Morgan & John C. Marioni

Authors

Emma Dann
View author publications
You can also search for this author in PubMed Google Scholar
Neil C. Henderson
View author publications
You can also search for this author in PubMed Google Scholar
Sarah A. Teichmann
View author publications
You can also search for this author in PubMed Google Scholar
Michael D. Morgan
View author publications
You can also search for this author in PubMed Google Scholar
John C. Marioni
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

E.D., M.D.M. and J.C.M. conceived the method idea. E.D. and M.D.M. developed the method, wrote the code and performed analyses. E.D., M.D.M., S.A.T. and N.C.H. interpreted the results. E.D., M.D.M., S.A.T., N.C.H. and J.C.M. wrote and approved the manuscript. M.D.M. and J.C.M. oversaw the project.

Corresponding authors

Correspondence to Michael D. Morgan or John C. Marioni.

Ethics declarations

Competing interests

In the last three years, S.A.T. has received remuneration for consulting and Scientific Advisory Board membership from Genentech, Roche, Biogen, ForesiteLabs and Qiagen. All other authors have no competing interests to declare.

Additional information

Peer review information Nature Biotechnology thanks Dana Pe’er, Michael Love and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Fig. 1 Benchmarking DA methods on simulated data.

DA analysis performance on KNN graphs from simulated datasets of different topologies: (a) discrete clusters (2700 cells, 3 populations); (b) 1-D linear trajectory (7500 cells, 7 populations); (c) Branching trajectory (7500 cells, 10 populations). Boxplots show the median with interquartile ranges (25–75%); whiskers extend to the largest value no further than 1.5x the interquartile range from the distance from the box, with outlier data points shown beyond this range.

Extended Data Fig. 2 Sensitivity of DA methods to low fold change in abundance.

(a) True positive rate (TPR, top) and false positive rate (FPR, bottom) of DA methods calculated on cells in different bins of P(C1) used to generate condition labels (bin size = 0.05, the number on the x-axis indicates the lower value in the bin). The results for 36 simulations on 2 representative populations (colors) are shown. The filled points indicate the mean of each P(C1) bin. (b) Variability in Milo power is explained by the fraction of true positive cells close to the DA threshold for definition of ground truth. Example distributions of P(C1) for cells detected as true positives (TP) or false negatives (FN) by Milo. Examples for simulations on 2 populations (rows) and 3 simulated fold changes (columns) are shown. (c, d) True Positive Rate (TPR) of DA detection for simulated DA regions of increasing size centred at the same centroid (Erythroid2 (c) and Caudal neuroectoderm (d)). Results for 3 condition simulations per population and fold change are shown.

Extended Data Fig. 3 Comparison of Milo and MELD for abundance fold change estimation.

(a–d) Scatter-plots of the true fold change at the neighbourhood index against the fold change estimated by Milo (A,C) and MELD (B,D), without batch effect (a, b) and with batch effect (magnitude = 0.5) (c, d), where LFC = log(pc’/(1 - pc’)). The neighbourhoods overlapping true DA cells (pc’ greater than the 75% quantile of P(C1) in the mouse gastrulation dataset) are highlighted in red. (e, f) Mean Squared Error (MSE) comparison for MELD and Milo for true negative neighbourhood (e) and true positive neighbourhoods (f), with increasing simulated log-Fold Change and magnitude of batch effect. Each boxplot summarises the results for n=27 simulations. Box plots show the median with interquartile ranges (25–75%); whiskers extend to the largest value no further than 1.5x the interquartile range from the distance from the box, with outlier data points shown beyond this range.

Extended Data Fig. 4 Controlling for batch effects in differential abundance analysis.

(a) In silico batch correction enhances the performance of DA methods in the presence of batch effects: comparison of performance of DA methods with no batch effect, with batch effects of increasing magnitude corrected with MNN, and uncorrected batch effects. Each boxplot summarises results from simulations on n=9 populations. (b) True Positive Rate (TPR, left) and False Discovery Rate (FDR, right) for recovery of cells in simulated DA regions for DA populations with increasing batch effect magnitude on the mouse gastrulation dataset. For each boxplot, results from 8 populations and 3 condition simulations per population are shown (n=24 simulations). Each panel represents a different DA method and a different simulated log-Fold Change. (c) Comparison of Milo performance with (~ batch + condition) or without (~ condition) accounting for the simulated batch in the NB-GLM. For each boxplot, results from 8 populations, simulated fold change > 1.5 and 3 condition simulations per population and fold change are shown (72 simulations per boxplot). In all panels, boxplots show the median with interquartile ranges (25–75%); whiskers extend to the largest value no further than 1.5x the interquartile range from the distance from the box, with outlier data points shown beyond this range.

Supplementary information

Supplementary Information

Supplementary Figs. 1–7, Supplementary Tables 1–3 and Supplementary Notes

Reporting Summary

Rights and permissions

Reprints and permissions

About this article

Cite this article

Dann, E., Henderson, N.C., Teichmann, S.A. et al. Differential abundance testing on single-cell data using k-nearest neighbor graphs. Nat Biotechnol 40, 245–253 (2022). https://doi.org/10.1038/s41587-021-01033-z

Download citation

Received: 20 November 2020
Accepted: 26 July 2021
Published: 30 September 2021
Issue Date: February 2022
DOI: https://doi.org/10.1038/s41587-021-01033-z

This article is cited by

Benchmarking differential abundance methods for finding condition-specific prototypical cells in multi-sample single-cell datasets
- Haidong Yi
- Alec Plotkin
- Natalie Stanley
Genome Biology (2024)
Single cell analysis reveals the roles and regulatory mechanisms of type-I interferons in Parkinson’s disease
- Pusheng Quan
- Xueying Li
- Lifen Yao
Cell Communication and Signaling (2024)
Kernel-based testing for single-cell differential analysis
- A. Ozier-Lafontaine
- C. Fourneaux
- F. Picard
Genome Biology (2024)
Cellograph: a semi-supervised approach to analyzing multi-condition single-cell RNA-sequencing data using graph neural networks
- Jamshaid A. Shahir
- Natalie Stanley
- Jeremy E. Purvis
BMC Bioinformatics (2024)
Chromatin accessibility during human first-trimester neurodevelopment
- Camiel C. A. Mannens
- Lijuan Hu
- Sten Linnarsson
Nature (2024)