Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks

Trapnell, Cole; Roberts, Adam; Goff, Loyal; Pertea, Geo; Kim, Daehwan; Kelley, David R; Pimentel, Harold; Salzberg, Steven L; Rinn, John L; Pachter, Lior

doi:10.1038/nprot.2012.016

Protocol
Published: 01 March 2012

Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks

Cole Trapnell^1,2,
Adam Roberts³,
Loyal Goff^1,2,4,
Geo Pertea^5,6,
Daehwan Kim^5,7,
David R Kelley^1,2,
Harold Pimentel³,
Steven L Salzberg^5,6,
John L Rinn^1,2 &
…
Lior Pachter^3,8,9

Nature Protocols volume 7, pages 562–578 (2012)Cite this article

267k Accesses
9093 Citations
125 Altmetric
Metrics details

Subjects

A Corrigendum to this article was published on 25 September 2014

This article has been updated

Abstract

Recent advances in high-throughput cDNA sequencing (RNA-seq) can reveal new genes and splice variants and quantify expression genome-wide in a single assay. The volume and complexity of data from RNA-seq experiments necessitate scalable, fast and mathematically principled analysis software. TopHat and Cufflinks are free, open-source software tools for gene discovery and comprehensive expression analysis of high-throughput mRNA sequencing (RNA-seq) data. Together, they allow biologists to identify new genes and new splice variants of known ones, as well as compare gene and transcript expression under two or more conditions. This protocol describes in detail how to use TopHat and Cufflinks to perform such analyses. It also covers several accessory tools and utilities that aid in managing data, including CummeRbund, a tool for visualizing RNA-seq analysis results. Although the procedure assumes basic informatics skills, these tools assume little to no background with RNA-seq analysis and are meant for novices and experts alike. The protocol begins with raw sequencing reads and produces a transcriptome assembly, lists of differentially expressed and regulated genes and transcripts, and publication-quality visualizations of analysis results. The protocol's execution time depends on the volume of transcriptome sequencing data and available computing resources but takes less than 1 d of computer time for typical experiments and ∼1 h of hands-on time.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on Springer Link
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

**Figure 1: Software components used in this protocol.**

**Figure 2: An overview of the Tuxedo protocol.**

**Figure 3: Merging sample assemblies with a reference transcriptome annotation.**

**Figure 4: Analyzing groups of transcripts identifies differentially regulated genes.**

**Figure 5: CummeRbund helps users rapidly explore their expression data and create publication-ready plots of differentially expressed and regulated genes.**

**Figure 6: CummeRbund plots of the expression level distribution for all genes in simulated experimental conditions C1 and C2.**

**Figure 7: CummeRbund scatter plots highlight general similarities and specific outliers between conditions C1 and C2.**

**Figure 9: Differential analysis results for *regucalcin*.**

**Figure 10: Differential analysis results for *Rala*.**

Systematic assessment of long-read RNA-seq methods for transcript identification and quantification

Article Open access 07 June 2024

ShIVA: a user-friendly and interactive interface giving biologists control over their single-cell RNA-seq data

Article Open access 01 September 2023

Tutorial: guidelines for the computational analysis of single-cell RNA sequencing data

Article 07 December 2020

Accession codes

Accessions

Gene Expression Omnibus

GSE32038

Change history

07 August 2014
In the version of this article initially published, the computer script in Box 1 sections B and C, and in Procedure Step 1, contained errors: the last section of the final three lines of the script had ‘C1’ where it should have been ‘C2’, as follows: C1_R1_2.fq should have been C2_R1_2.fq C1_R2_2.fq should have been C2_R2_2.fq C1_R3_2.fq should have been C2_R3_2.fq Users are also directed to an official release version of Cufflinks (version 1.3.0) that produces nearly identical results to those shown in the manuscript, which were produced by Cufflinks 1.2.1 (an unofficial and undocumented development build that was the latest build available when the manuscript was originally written). The script in Procedure Step 16 and the data in Table 5 have been updated to reflect the output of version 1.3.0. The errors have been corrected in the HTML and PDF versions of the article.

References

Mortazavi, A., Williams, B.A., McCue, K., Schaeffer, L. & Wold, B. Mapping and quantifying mammalian transcriptomes by RNA-seq. Nat. Methods 5, 621–628 (2008).
Article CAS Google Scholar
Cloonan, N. et al. Stem cell transcriptome profiling via massive-scale mRNA sequencing. Nat. Methods 5, 613–619 (2008).
Article CAS Google Scholar
Nagalakshmi, U. et al. The transcriptional landscape of the yeast genome defined by RNA sequencing. Science 320, 1344–1349 (2008).
Article CAS Google Scholar
Mardis, E.R. The impact of next-generation sequencing technology on genetics. Trends Genet. 24, 133–141 (2008).
CAS PubMed Google Scholar
Adams, M.D. et al. Sequence identification of 2,375 human brain genes. Nature 355, 632–634 (1992).
Article CAS Google Scholar
Cabili, M.N. et al. Integrative annotation of human large intergenic noncoding RNAs reveals global properties and specific subclasses. Genes Dev. 25, 1915–1927 (2011).
Article CAS Google Scholar
Jiang, H. & Wong, W.H. Statistical inferences for isoform expression in RNA-seq. Bioinformatics 25, 1026–1032 (2009).
Article CAS Google Scholar
Trapnell, C. et al. Transcript assembly and quantification by RNA-seq reveals unannotated transcripts and isoform switching during cell differentiation. Nat. Biotechnol. 28, 511–515 (2010).
Article CAS Google Scholar
Mortimer, S.A. & Weeks, K.M. A fast-acting reagent for accurate analysis of RNA secondary and tertiary structure by SHAPE chemistry. J. Am. Chem. Soc. 129, 4144–4145 (2007).
Article CAS Google Scholar
Li, B., Ruotti, V., Stewart, R.M., Thomson, J.A. & Dewey, C.N. RNA-seq gene expression estimation with read mapping uncertainty. Bioinformatics 26, 493–500 (2010).
Article Google Scholar
Marioni, J.C., Mason, C.E., Mane, S.M., Stephens, M. & Gilad, Y. RNA-seq: an assessment of technical reproducibility and comparison with gene expression arrays. Genome Res. 18, 1509–1517 (2008).
Article CAS Google Scholar
Garber, M., Grabherr, M.G., Guttman, M. & Trapnell, C. Computational methods for transcriptome annotation and quantification using RNA-seq. Nat. Methods 8, 469–477 (2011).
Article CAS Google Scholar
Trapnell, C., Pachter, L. & Salzberg, S.L. TopHat: discovering splice junctions with RNA-seq. Bioinformatics 25, 1105–1111 (2009).
Article CAS Google Scholar
Lister, R. et al. Hotspots of aberrant epigenomic reprogramming in human induced pluripotent stem cells. Nature 470, 68–73 (2011).
Article Google Scholar
Graveley, B.R. et al. The developmental transcriptome of Drosophila melanogaster. Nature 471, 473–479 (2011).
Article CAS Google Scholar
Twine, N.A., Janitz, K., Wilkins, M.R. & Janitz, M. Whole transcriptome sequencing reveals gene expression and splicing differences in brain regions affected by Alzheimer's disease. PLoS ONE 6, e16266 (2011).
Article CAS Google Scholar
Mizuno, H. et al. Massive parallel sequencing of mRNA in identification of unannotated salinity stress-inducible transcripts in rice (Oryza sativa L.). BMC Genomics 11, 683 (2010).
Article CAS Google Scholar
Goecks, J., Nekrutenko, A. & Taylor, J. Galaxy Team Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences. Genome Biol. 11, R86 (2010).
Article Google Scholar
Wu, T.D. & Nacu, S. Fast and SNP-tolerant detection of complex variants and splicing in short reads. Bioinformatics 26, 873–881 (2010).
Article CAS Google Scholar
Wang, K. et al. MapSplice: accurate mapping of RNA-seq reads for splice junction discovery. Nucleic Acids Res. 38, e178 (2010).
Article Google Scholar
Au, K.F., Jiang, H., Lin, L., Xing, Y. & Wong, W.H. Detection of splice junctions from paired-end RNA-seq data by SpliceMap. Nucleic Acids Res. 38, 4570–4578 (2010).
Article CAS Google Scholar
Guttman, M. et al. Ab initio reconstruction of cell type-specific transcriptomes in mouse reveals the conserved multi-exonic structure of lincRNAs. Nat. Biotechnol. 28, 503–510 (2010).
Article CAS Google Scholar
Griffith, M. et al. Alternative expression analysis by RNA sequencing. Nat. Methods 7, 843–847 (2010).
Article CAS Google Scholar
Katz, Y., Wang, E.T., Airoldi, E.M. & Burge, C.B. Analysis and design of RNA sequencing experiments for identifying isoform regulation. Nat. Methods 7, 1009–1015 (2010).
Article CAS Google Scholar
Nicolae, M., Mangul, S., Măndoiu, I.I. & Zelikovsky, A. Estimation of alternative splicing isoform frequencies from RNA-seq data. Algorithms Mol. Biol. 6, 9 (2011).
Article Google Scholar
Anders, S. & Huber, W. Differential expression analysis for sequence count data. Genome Biol. 11, R106 (2010).
Article CAS Google Scholar
Robinson, M.D., McCarthy, D.J. & Smyth, G.K. edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics 26, 139–140 (2009).
Article Google Scholar
Wang, L., Feng, Z., Wang, X., Wang, X. & Zhang, X. DEGseq: an R package for identifying differentially expressed genes from RNA-seq data. Bioinformatics 26, 136–138 (2010).
Article Google Scholar
Grabherr, M.G. et al. Full-length transcriptome assembly from RNA-seq data without a reference genome. Nat. Biotechnol. 29, 644–652 (2011).
Article CAS Google Scholar
Robertson, G. et al. De novo assembly and analysis of RNA-seq data. Nat. Methods 7, 909–912 (2010).
Article CAS Google Scholar
Johnson, D.S., Mortazavi, A., Myers, R.M. & Wold, B. Genome-wide mapping of in vivo protein-DNA interactions. Science 316, 1497–1502 (2007).
Article CAS Google Scholar
Ingolia, N.T., Ghaemmaghami, S., Newman, J.R.S. & Weissman, J.S. Genome-wide analysis in vivo of translation with nucleotide resolution using ribosome profiling. Science 324, 218–223 (2009).
Article CAS Google Scholar
Langmead, B., Trapnell, C., Pop, M. & Salzberg, S.L. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol. 10, R25 (2009).
Article Google Scholar
Ferragina, P. & Manzini, G. An experimental study of a compressed index. Information Sci. 135, 13–28 (2001).
Article Google Scholar
Roberts, A., Pimentel, H., Trapnell, C. & Pachter, L. Identification of novel transcripts in annotated genomes using RNA-seq. Bioinformatics 27, 2325–2329 (2011).
Article CAS Google Scholar
Li, J., Jiang, H. & Wong, W.H. Modeling non-uniformity in short-read rates in RNA-seq data. Genome Biol. 11, R50 (2010).
Article Google Scholar
Hansen, K.D., Brenner, S.E. & Dudoit, S. Biases in Illumina transcriptome sequencing caused by random hexamer priming. Nucleic Acids Res. 38, e131 (2010).
Article Google Scholar
Roberts, A., Trapnell, C., Donaghey, J., Rinn, J.L. & Pachter, L. Improving RNA-seq expression estimates by correcting for fragment bias. Genome Biol. 12, R22 (2011).
Article CAS Google Scholar
Levin, J.Z. et al. Comprehensive comparative analysis of strand-specific RNA sequencing methods. Nat. Methods 7, 709–715 (2010).
Article CAS Google Scholar
Hansen, K.D., Wu, Z., Irizarry, R.A. & Leek, J.T. Sequencing technology does not eliminate biological variability. Nat. Biotechnol. 29, 572–573 (2011).
Article CAS Google Scholar
Wickham, H. ggplot2: Elegant Graphics for Data Analysis (Use R) p 224 (Springer, 2009).
Robinson, J.T. et al. Integrative genomics viewer. Nat. Biotechnol. 29, 24–26 (2011).
Article CAS Google Scholar
Li, H. et al. The Sequence Alignment/Map format and SAMtools. Bioinformatics 25, 2078–2079 (2009).
Article Google Scholar
Schatz, M.C., Langmead, B. & Salzberg, S.L. Cloud computing and the DNA data race. Nat. Biotechnol. 28, 691–693 (2010).
Article CAS Google Scholar

Download references

Acknowledgements

We are grateful to D. Hendrickson, M. Cabili and B. Langmead for helpful technical discussions. The TopHat and Cufflinks projects are supported by US National Institutes of Health grants R01-HG006102 (to S.L.S.) and R01-HG006129-01 (to L.P.). C.T. is a Damon Runyon Cancer Foundation Fellow. L.G. is a National Science Foundation Postdoctoral Fellow. A.R. is a National Science Foundation Graduate Research Fellow. J.L.R. is a Damon Runyon-Rachleff, Searle, and Smith Family Scholar, and is supported by Director's New Innovator Awards (1DP2OD00667-01). This work was funded in part by the Center of Excellence in Genome Science from the US National Human Genome Research Institute (J.L.R.). J.L.R. is an investigator of the Merkin Foundation for Stem Cell Research at the Broad Institute.

Author information

Authors and Affiliations

Broad Institute of MIT and Harvard, Cambridge, Massachusetts, USA
Cole Trapnell, Loyal Goff, David R Kelley & John L Rinn
Department of Stem Cell and Regenerative Biology, Harvard University, Cambridge, Massachusetts, USA
Cole Trapnell, Loyal Goff, David R Kelley & John L Rinn
Department of Computer Science, University of California, Berkeley, California, USA
Adam Roberts, Harold Pimentel & Lior Pachter
Department of Electrical Engineering and Computer Science, Computer Science and Artificial Intelligence Lab, Massachusetts Institute of Technology, Cambridge, Massachusetts, USA
Loyal Goff
Department of Medicine, McKusick-Nathans Institute of Genetic Medicine, Johns Hopkins University School of Medicine, Baltimore, Maryland, USA
Geo Pertea, Daehwan Kim & Steven L Salzberg
Department of Biostatistics, Johns Hopkins University, Baltimore, Maryland, USA
Geo Pertea & Steven L Salzberg
Center for Bioinformatics and Computational Biology, University of Maryland, College Park, Maryland, USA
Daehwan Kim
Department of Mathematics, University of California, Berkeley, California, USA
Lior Pachter
Department of Molecular and Cell Biology, University of California, Berkeley, California, USA
Lior Pachter

Authors

Cole Trapnell
View author publications
You can also search for this author in PubMed Google Scholar
Adam Roberts
View author publications
You can also search for this author in PubMed Google Scholar
Loyal Goff
View author publications
You can also search for this author in PubMed Google Scholar
Geo Pertea
View author publications
You can also search for this author in PubMed Google Scholar
Daehwan Kim
View author publications
You can also search for this author in PubMed Google Scholar
David R Kelley
View author publications
You can also search for this author in PubMed Google Scholar
Harold Pimentel
View author publications
You can also search for this author in PubMed Google Scholar
Steven L Salzberg
View author publications
You can also search for this author in PubMed Google Scholar
John L Rinn
View author publications
You can also search for this author in PubMed Google Scholar
Lior Pachter
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

C.T. is the lead developer for the TopHat and Cufflinks projects. L.G. designed and wrote CummeRbund. D.K., H.P. and G.P. are developers of TopHat. A.R. and G.P. are developers of Cufflinks and its accompanying utilities. C.T. developed the protocol, generated the example experiment and performed the analysis. L.P., S.L.S. and C.T. conceived the TopHat and Cufflinks software projects. C.T., D.R.K. and J.L.R. wrote the manuscript.

Corresponding author

Correspondence to Cole Trapnell.

Ethics declarations

Competing interests

The authors declare no competing financial interests.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Trapnell, C., Roberts, A., Goff, L. et al. Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks. Nat Protoc 7, 562–578 (2012). https://doi.org/10.1038/nprot.2012.016

Download citation

Published: 01 March 2012
Issue Date: March 2012
DOI: https://doi.org/10.1038/nprot.2012.016

This article is cited by

Dysregulated lncRNAs regulate human umbilical cord mesenchymal stem cell differentiation into insulin-producing cells by forming a regulatory network with mRNAs
- Tianqin Xie
- Qiming Huang
- Jianping Liu
Stem Cell Research & Therapy (2024)
miR-129-5p as a biomarker for pathology and cognitive decline in Alzheimer’s disease
- Sang-Won Han
- Jung-Min Pyun
- Kwangsik Nho
Alzheimer's Research & Therapy (2024)
Comparative transcriptome analysis reveals genes involved in trichome development and metabolism in tobacco
- Mingli Chen
- Zhiyuan Li
- Daping Gong
BMC Plant Biology (2024)
KAKU4 regulates leaf senescence through modulation of H3K27me3 deposition in the Arabidopsis genome
- Yaxin Cao
- Hengyu Yan
- Zhen Su
BMC Plant Biology (2024)
PatWRKY71 transcription factor regulates patchoulol biosynthesis and plant defense response
- Jian Li
- Huan-Chao Huang
- Kuai-Fei Xia
BMC Plant Biology (2024)

Comments

By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.