Abstract
Modern RNA-sequencing (RNA-seq) protocols can produce multi-end data, where multiple reads originating from the same transcript are attached to the same barcode. The long-range information in the multi-end reads is beneficial in phasing complicated spliced isoforms, but assembly algorithms that leverage such information are lacking. Here we introduce Scallop2, a reference-based assembler optimized for multi-end RNA-seq data. The algorithmic core of Scallop2 consists of three steps: (1) using an algorithm to ‘bridge’ multi-end reads into single-end phasing paths in the context of a splice graph, (2) employing a method to refine erroneous splice graphs by utilizing multi-end reads that fail to bridge and (3) piping the refined splice graph and bridged phasing paths into an algorithm that integrates multiple phase-preserving decompositions. Tested on 561 cells in two Smart-seq3 datasets and on ten Illumina paired-end RNA-seq samples, Scallop2 substantially improves the assembly accuracy compared with two popular assemblers (StringTie2 and Scallop).
This is a preview of subscription content, access via your institution
Access options
Access Nature and 54 other Nature Portfolio journals
Get Nature+, our best-value online-access subscription
$29.99 / 30 days
cancel any time
Subscribe to this journal
Receive 12 digital issues and online access to articles
$99.00 per year
only $8.25 per issue
Buy this article
- Purchase on Springer Link
- Instant access to full article PDF
Prices may be subject to local taxes which are calculated during checkout
Similar content being viewed by others
Data availability
Smart-seq3 is a single-cell protocol that generates multi-end RNA-seq data using barcoding technology. The Smart-seq3 data used in this study was published with Smart-seq3 protocol17, publicly available at https://www.ebi.ac.uk/arrayexpress/experiments/E-MTAB-8735. We use two datasets downloaded from Smart-seq3: (1) HEK293T containing 192 human cells; (2) Mouse-Fibroblast including 369 mouse cells. For the Illumina platform, we use ten human paired-end RNA-seq samples downloaded from the ENCODE project22 and we refer to these samples as ENCODE10. Their accession IDs are SRR307903, SRR307911, SRR315323, SRR315334, SRR387661, SRR534291, SRR534307, SRR534319, SRR545695 and SRR545723. The alignments of these samples are available at the Penn State Data Commons repository23. Source data are provided with this paper.
Code availability
Scallop2 is available at the Zenodo repository24 and GitHub (https://github.com/Shao-Group/scallop2). Scripts and documentation that reproduce the experimental results are available at the Zenodo repository25 and GitHub (https://github.com/Shao-Group/scallop2-test).
References
Trapnell, C. et al. Transcript assembly and quantification by RNA-seq reveals unannotated transcripts and isoform switching during cell differentiation. Nat. Biotechnol. 28, 511–515 (2010).
Guttman, M. et al. Ab initio reconstruction of cell type–specific transcriptomes in mouse reveals the conserved multi-exonic structure of lincRNAs. Nat. Biotechnol. 28, 503–510 (2010).
Tomescu, A. I., Kuosmanen, A., Rizzi, R. & Mäkinen, V. A novel min-cost flow method for estimating transcript expression with RNA-seq. BMC Bioinformatics 14, S15 (2013).
Song, L., Sabunciyan, S. & Florea, L. CLASS2: accurate and efficient splice variant annotation from RNA-seq reads. Nucleic Acids Res. 44, e98 (2016).
Liu, J., Yu, T., Jiang, T. & Li, G. TransComb: genome-guided transcriptome assembly via combing junctions in splicing graphs. Genome Biol. 17, 213 (2016).
Shao, M. & Kingsford, C. Accurate assembly of transcripts through phase-preserving graph decomposition. Nat. Biotechnol. 35, 1167–1169 (2017).
Pertea, M. et al. StringTie enables improved reconstruction of a transcriptome from RNA-seq reads. Nat. Biotechnol. 33, 290–295 (2015).
Kovaka, S. et al. Transcriptome assembly from long-read RNA-seq alignments with StringTie2. Genome Biol. 20, 278 (2019).
Mao, S., Pachter, L., Tse, D. & Kannan, S. RefShannon: a genome-guided transcriptome assembler using sparse flow decomposition. PLoS ONE 15, e0232946 (2020).
Tung, L. H., Shao, M. & Kingsford, C. Quantifying the benefit offered by transcript assembly with Scallop-LR on single-molecule long reads. Genome Biol. 20, 287 (2019).
Shao, M. & Kingsford, C. Theory and a heuristic for the minimum path flow decomposition problem. IEEE/ACM Trans. Comput. Biol. Bioinform. 16, 658–670 (2019).
Williams, L. & Tomescu, A., & Mumey, B. M. Flow decomposition with subpath constraints. In 21st International Workshop on Algorithms in Bioinformatics (WABI 2021) Vol. 201 (eds Carbone, A. & El-Kebir, M.) 16.1–16.15 (2021).
Williams, L., Reynolds, G. & Mumey, B. RNA transcript assembly using inexact flows. In 2019 IEEE International Conference on Bioinformatics and Biomedicine (BIBM) 1907–1914 (IEEE, 2019).
Amarasinghe, S. L. et al. Opportunities and challenges in long-read sequencing data analysis. Genome Biol. 21, 30 (2020).
Voshall, A. & Moriyama, E. N. in Bioinformatics in the Era of Post Genomics and Big Data (ed. Abdurakhmonov, I. Y.) 15–36 (IntechOpen, 2018).
Picelli, S. et al. Full-length RNA-seq from single cells using Smart-seq2. Nat. Protoc. 9, 171–181 (2014).
Hagemann-Jensen, M. et al. Single-cell RNA counting at allele and isoform resolution using Smart-seq3. Nat. Biotechnol. 38, 708–714 (2020).
Pertea, G. & Pertea, M. GFF utilities: GffRead and GffCompare. F1000Res 9, 304 (2020).
Kim, D. et al. TopHat2: accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions. Genome Biol. 14, R36 (2013).
Dobin, A. et al. STAR: ultrafast universal RNA-seq aligner. Bioinformatics 29, 15–21 (2013).
Kim, D., Langmead, B. & Salzberg, S. L. HISAT: a fast spliced aligner with low memory requirements. Nat. Methods 12, 357–360 (2015).
ENCODE Project Consortium An integrated encyclopedia of DNA elements in the human genome. Nature 489, 57–74 (2012).
Shi, Q. & Shao, M. ENCODE10 dataset. Penn State Data Commons https://doi.org/10.26208/8c06-w247 (2020).
Zhang, Q., Shi, Q. & Shao, M. Code for Scallop2. Zenodo https://doi.org/10.5281/zenodo.6013717 (2022).
Zhang, Q., Shi, Q. & Shao, M. Code for Scallop2-test. Zenodo https://doi.org/10.5281/zenodo.6064927 (2022).
Acknowledgements
This work is partly supported by the US National Science Foundation (DBI-2019797 to M.S.), the US National Institutes of Health (R01HG011065 to M.S.) and the Charles K. Etner Career Development Professorship awarded to M.S. by The Pennsylvania State University. Initial algorithmic exploration of Scallop2 advancements were conducted with C. Kingsford (Computational Biology Department, School of Computer Science, Carnegie Mellon University) and were supported by the Gordon and Betty Moore Foundation (GMBF 4554 to C. Kingsford) and the US National Institutes of Health (R01GM122935 to C. Kingsford).
Author information
Authors and Affiliations
Contributions
Q.Z., Q.S. and M.S. designed and implemented the algorithms. The experimental comparisons were primarily conducted by Q.Z. All the authors drafted and approved this manuscript.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Peer review
Peer review information
Nature Computational Science thanks Manuel Garber, Alexandru Tomescu and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Primary Handling Editor: Kaitlin McCardle, in collaboration with the Nature Computational Science team.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Extended data
Extended Data Fig. 1 Illustrating the construction of the splice graph \(G\) and the associated multi-end phasing paths \({{{\mathcal{C}}}}\) from read alignments of a gene locus.
Inferred splice positions in the reference genome are marked with black bars. Exons and partial exons are labeled with numbers above the reference genome. Alignment reads with the same color represent they form a group (i.e., attached with the same barcode or being the two ends of a pair in paired-end RNA-seq reads); we use gray color to represent individual read forming group of its own. The read alignments are used as input to construct the splice graph and the associated multi-end phasing paths. From the given alignments 5 (partial) exons (numbered 1–5) are identified. A source vertex s and a sink vertex t are added to the splice graph. Each numbered arrow in the splice graph represents a directed edge and its weight.
Extended Data Fig. 2 Illustration of bridging a multi-end phasing path into a single-end phasing path.
Each numbered arrow represents a directed edge and its weight. Vertices and edges in the multi-end phasing path \(C\) are blue circled and arrowed respectively. Inferred bridging paths for (l1, l2) and (l2, l3) are marked red. The single-end phasing path \(h(C)=(1,2,3,4,6)\) is bridged using the splice graph and multi-end phasing path \(C=(l_1=(1,2),l_2=(4),l_3=(6))\) as input.
Extended Data Fig. 3 Illustration of identifying false starting/ending vertices.
Inferred splice positions in the reference genome are marked with black bars. Exons and partial exons are labeled with numbers above the reference genome. Alignment reads with the same color represent they form a group (i.e., attached with the same barcode or being the two ends of a pair in paired-end RNA-seq reads); we use gray color to represent individual read forming group of its own. From the given alignments 4 partial-exons (numbered 1–4) are identified. A source vertex s and a sink vertex t are added to the splice graph. Each numbered arrow represents a directed edge and its weight. The circled vertex with number 2 (respectively 3) is classified as ending (respectively starting) vertex as there is no departing (respectively entering) junction. A pseudo-edge (2,3) from the circled vertex with number 2 to the circled vertex with number 3 will be added to the splice graph for bridging. The bridging of blue reads and red reads (the second and the third ones) will cross this pseudo-edge, giving a pseudo-score of 2 for both vertices.
Supplementary information
Supplementary Information
Supplementary Figs. 1–15, Tables 1 and 2, Notes 1–8 and Algorithm 1.
Source data
Source Data Fig. 1
Statistical source data for Fig. 1.
Rights and permissions
About this article
Cite this article
Zhang, Q., Shi, Q. & Shao, M. Accurate assembly of multi-end RNA-seq data with Scallop2. Nat Comput Sci 2, 148–152 (2022). https://doi.org/10.1038/s43588-022-00216-1
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1038/s43588-022-00216-1
This article is cited by
-
Chromosome-level genome assembly of the silver pomfret Pampus argenteus
Scientific Data (2024)
-
Bridge over troubled transcripts
Nature Computational Science (2022)