Abstract
Measuring complete gene expression profiles for a large number of experiments is costly. We propose an approach in which a small subset of probes is selected based on a preliminary set of full expression profiles. In subsequent experiments, only the subset is measured, and the missing values are imputed. We developed several algorithms to simultaneously select probes and impute missing values, and we demonstrate that these 'probe selection for imputation' (PSI) algorithms can successfully reconstruct missing gene expression values in a wide variety of applications, as evaluated using multiple metrics of biological importance. We analyze the performance of PSI methods under varying conditions, provide guidelines for choosing the optimal method based on the experimental setting, and indicate how to estimate imputation accuracy. Finally, we apply our approach to a large-scale study of immune system variation.
This is a preview of subscription content, access via your institution
Access options
Subscribe to this journal
Receive 12 print issues and online access
$259.00 per year
only $21.58 per issue
Buy this article
- Purchase on Springer Link
- Instant access to full article PDF
Prices may be subject to local taxes which are calculated during checkout
Similar content being viewed by others
Accession codes
References
Amit, I. et al. Unbiased reconstruction of a mammalian transcriptional network mediating pathogen responses. Science 326, 257–263 (2009).
Golub, T.R. et al. Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 286, 531–537 (1999).
Schena, M., Shalon, D., Davis, R.W. & Brown, P.O. Quantitative monitoring of gene expression patterns with a complementary DNA microarray. Science 270, 467–470 (1995).
Cheung, V.G. et al. Natural variation in human gene expression assessed in lymphoblastoid cells. Nat. Genet. 33, 422–425 (2003).
Schadt, E.E. et al. Genetics of gene expression surveyed in maize, mouse and man. Nature 422, 297–302 (2003).
Emilsson, V. et al. Genetics of gene expression and its effect on disease. Nature 452, 423–428 (2008).
Su, A.I. et al. Large-scale analysis of the human and mouse transcriptomes. Proc. Natl. Acad. Sci. USA 99, 4465–4470 (2002).
Lamb, J. et al. The Connectivity Map: using gene-expression signatures to connect small molecules, genes, and disease. Science 313, 1929–1935 (2006).
Lein, E.S. et al. Genome-wide atlas of gene expression in the adult mouse brain. Nature 445, 168–176 (2007).
Dimas, A.S. et al. Common regulatory variation impacts gene expression in a cell type-dependent manner. Science 325, 1246–1250 (2009).
Alizadeh, A.A. et al. Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling. Nature 403, 503–511 (2000).
Gasch, A.P. et al. Genomic expression programs in the response of yeast cells to environmental changes. Mol. Biol. Cell 11, 4241–4257 (2000).
Wagner, A. Estimating coarse gene network structure from large-scale gene perturbation data. Genome Res. 12, 309–315 (2002).
Hughes, T.R. et al. Functional discovery via a compendium of expression profiles. Cell 102, 109–126 (2000).
Whitfield, M.L. et al. Identification of genes periodically expressed in the human cell cycle and their expression in tumors. Mol. Biol. Cell 13, 1977–2000 (2002).
Chu, S. et al. The transcriptional program of sporulation in budding yeast. Science 282, 699–705 (1998).
Cho, R.J. et al. A genome-wide transcriptional analysis of the mitotic cell cycle. Mol. Cell 2, 65–73 (1998).
Spellman, P.T. et al. Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization. Mol. Biol. Cell 9, 3273–3297 (1998).
DeRisi, J.L., Iyer, V.R. & Brown, P.O. Exploring the metabolic and genetic control of gene expression on a genomic scale. Science 278, 680–686 (1997).
Pomeroy, S.L. et al. Prediction of central nervous system embryonal tumour outcome based on gene expression. Nature 415, 436–442 (2002).
van 't Veer, L.J. et al. Gene expression profiling predicts clinical outcome of breast cancer. Nature 415, 530–536 (2002).
Bibikova, M. et al. Quantitative gene expression profiling in formalin-fixed, paraffin-embedded tissues using universal bead arrays. Am. J. Pathol. 165, 1799–1807 (2004).
Paik, S. et al. A multigene assay to predict recurrence of tamoxifen-treated, node-negative breast cancer. N. Engl. J. Med. 351, 2817–2826 (2004).
Bustin, S.A. Absolute quantification of mRNA using real-time reverse transcription polymerase chain reaction assays. J. Mol. Endocrinol. 25, 169–193 (2000).
Geiss, G.K. et al. Direct multiplexed measurement of gene expression with color-coded probe pairs. Nat. Biotechnol. 26, 317–325 (2008).
Spurgeon, S.L., Jones, R.C. & Ramakrishnan, R. High throughput gene expression measurement with real time PCR in a microfluidic dynamic array. PLoS ONE 3, e1662 (2008).
Gnirke, A. et al. Solution hybrid selection with ultra-long oligonucleotides for massively parallel targeted sequencing. Nat. Biotechnol. 27, 182–189 (2009).
Xing, E.P., Jordan, M.I. & Karp, R.M. Feature selection for high-dimensional genomic microarray data. in Proc. Int. Conf. Mach. Learn. (eds. Brodley, C.E. & Pohoreckyj Danyluk, A.) 601–608 (ICML 2001).
Hedenfalk, I. et al. Gene-expression profiles in hereditary breast cancer. N. Engl. J. Med. 344, 539–548 (2001).
Troyanskaya, O. et al. Missing value estimation methods for DNA microarrays. Bioinformatics 17, 520–525 (2001).
Heng, T.S.P. et al. The Immunological Genome Project: networks of gene expression in immune cells. Nat. Immunol. 9, 1091–1094 (2008).
Oba, S. et al. A Bayesian missing value estimation method for gene expression profile data. Bioinformatics 19, 2088–2096 (2003).
Kim, H., Golub, G.H. & Park, H. Missing value estimation for DNA microarray gene expression data: local least squares imputation. Bioinformatics 21, 187–198 (2005).
Bø, T.H., Dysvik, B. & Jonassen, I. LSimpute: accurate estimation of missing values in microarray data with least squares methods. Nucleic Acids Res. 32, e34 (2004).
Scherf, U. et al. A gene expression database for the molecular pharmacology of cancer. Nat. Genet. 24, 236–244 (2000).
Liu, X. et al. Analysis of cell fate from single-cell gene expression profiles in C. elegans. Cell 139, 623–633 (2009).
Zahn, J.M. et al. AGEMAP: a gene expression database for aging in mice. PLoS Genet. 3, e201 (2007).
Acknowledgements
This work was supported by grant RC2 GM093080 (funded through the American Recovery and Reinvestment Act) from the US National Institutes of Health–National Institute of General Medical Sciences to C.B. and D.K. We thank I. Amit and J. Ye for useful comments on this manuscript.
Author information
Authors and Affiliations
Contributions
Y.D. and D.K. designed the methods; Y.D. implemented the methods, wrote the code, performed the experiments and analyzed the data; T.F. and C.B. provided data and gave feedback on the results; Y.D. and D.K. wrote the manuscript; C.B. reviewed and commented on the manuscript.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing financial interests.
Supplementary information
Supplementary Text and Figures
Supplementary Figures 1–8, Supplementary Table 1, Supplementary Results and Supplementary Note (PDF 1946 kb)
Supplementary Data
The data sets used in the experiments described in the paper, in the file format used by the PSI software. (ZIP 11389 kb)
Supplementary Software
PSI software. (ZIP 16 kb)
Rights and permissions
About this article
Cite this article
Donner, Y., Feng, T., Benoist, C. et al. Imputing gene expression from selectively reduced probe sets. Nat Methods 9, 1120–1125 (2012). https://doi.org/10.1038/nmeth.2207
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1038/nmeth.2207
This article is cited by
-
Tradict enables accurate prediction of eukaryotic transcriptional states from 100 marker genes
Nature Communications (2017)
-
Profiling the transcriptome with RNA SPOTs
Nature Methods (2017)
-
Plasmodium falciparumgene expression measured directly from tissue during human infection
Genome Medicine (2014)