Accurate assembly of multi-end RNA-seq data with Scallop2

Zhang, Qimin; Shi, Qian; Shao, Mingfu

doi:10.1038/s43588-022-00216-1

Brief Communication
Published: 28 March 2022

Accurate assembly of multi-end RNA-seq data with Scallop2

Nature Computational Science volume 2, pages 148–152 (2022)Cite this article

789 Accesses
6 Citations
32 Altmetric
Metrics details

Subjects

A preprint version of the article is available at bioRxiv.

Abstract

Modern RNA-sequencing (RNA-seq) protocols can produce multi-end data, where multiple reads originating from the same transcript are attached to the same barcode. The long-range information in the multi-end reads is beneficial in phasing complicated spliced isoforms, but assembly algorithms that leverage such information are lacking. Here we introduce Scallop2, a reference-based assembler optimized for multi-end RNA-seq data. The algorithmic core of Scallop2 consists of three steps: (1) using an algorithm to ‘bridge’ multi-end reads into single-end phasing paths in the context of a splice graph, (2) employing a method to refine erroneous splice graphs by utilizing multi-end reads that fail to bridge and (3) piping the refined splice graph and bridged phasing paths into an algorithm that integrates multiple phase-preserving decompositions. Tested on 561 cells in two Smart-seq3 datasets and on ten Illumina paired-end RNA-seq samples, Scallop2 substantially improves the assembly accuracy compared with two popular assemblers (StringTie2 and Scallop).

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on Springer Link
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

**Fig. 1: Comparison of assembly accuracy of Scallop2, StringTie2, Scallop and CLASS2.**

Single-cell nascent RNA sequencing unveils coordinated global transcription

Article Open access 05 June 2024

High-resolution genome-wide mapping of chromosome-arm-scale truncations induced by CRISPR–Cas9 editing

Article Open access 29 May 2024

Cell-type-specific consequences of mosaic structural variants in hematopoietic stem and progenitor cells

Article Open access 28 May 2024

Data availability

Smart-seq3 is a single-cell protocol that generates multi-end RNA-seq data using barcoding technology. The Smart-seq3 data used in this study was published with Smart-seq3 protocol¹⁷, publicly available at https://www.ebi.ac.uk/arrayexpress/experiments/E-MTAB-8735. We use two datasets downloaded from Smart-seq3: (1) HEK293T containing 192 human cells; (2) Mouse-Fibroblast including 369 mouse cells. For the Illumina platform, we use ten human paired-end RNA-seq samples downloaded from the ENCODE project²² and we refer to these samples as ENCODE10. Their accession IDs are SRR307903, SRR307911, SRR315323, SRR315334, SRR387661, SRR534291, SRR534307, SRR534319, SRR545695 and SRR545723. The alignments of these samples are available at the Penn State Data Commons repository²³. Source data are provided with this paper.

Code availability

Scallop2 is available at the Zenodo repository²⁴ and GitHub (https://github.com/Shao-Group/scallop2). Scripts and documentation that reproduce the experimental results are available at the Zenodo repository²⁵ and GitHub (https://github.com/Shao-Group/scallop2-test).

References

Trapnell, C. et al. Transcript assembly and quantification by RNA-seq reveals unannotated transcripts and isoform switching during cell differentiation. Nat. Biotechnol. 28, 511–515 (2010).
Article Google Scholar
Guttman, M. et al. Ab initio reconstruction of cell type–specific transcriptomes in mouse reveals the conserved multi-exonic structure of lincRNAs. Nat. Biotechnol. 28, 503–510 (2010).
Article Google Scholar
Tomescu, A. I., Kuosmanen, A., Rizzi, R. & Mäkinen, V. A novel min-cost flow method for estimating transcript expression with RNA-seq. BMC Bioinformatics 14, S15 (2013).
Article Google Scholar
Song, L., Sabunciyan, S. & Florea, L. CLASS2: accurate and efficient splice variant annotation from RNA-seq reads. Nucleic Acids Res. 44, e98 (2016).
Article Google Scholar
Liu, J., Yu, T., Jiang, T. & Li, G. TransComb: genome-guided transcriptome assembly via combing junctions in splicing graphs. Genome Biol. 17, 213 (2016).
Article Google Scholar
Shao, M. & Kingsford, C. Accurate assembly of transcripts through phase-preserving graph decomposition. Nat. Biotechnol. 35, 1167–1169 (2017).
Article Google Scholar
Pertea, M. et al. StringTie enables improved reconstruction of a transcriptome from RNA-seq reads. Nat. Biotechnol. 33, 290–295 (2015).
Article Google Scholar
Kovaka, S. et al. Transcriptome assembly from long-read RNA-seq alignments with StringTie2. Genome Biol. 20, 278 (2019).
Article Google Scholar
Mao, S., Pachter, L., Tse, D. & Kannan, S. RefShannon: a genome-guided transcriptome assembler using sparse flow decomposition. PLoS ONE 15, e0232946 (2020).
Article Google Scholar
Tung, L. H., Shao, M. & Kingsford, C. Quantifying the benefit offered by transcript assembly with Scallop-LR on single-molecule long reads. Genome Biol. 20, 287 (2019).
Article Google Scholar
Shao, M. & Kingsford, C. Theory and a heuristic for the minimum path flow decomposition problem. IEEE/ACM Trans. Comput. Biol. Bioinform. 16, 658–670 (2019).
Article Google Scholar
Williams, L. & Tomescu, A., & Mumey, B. M. Flow decomposition with subpath constraints. In 21st International Workshop on Algorithms in Bioinformatics (WABI 2021) Vol. 201 (eds Carbone, A. & El-Kebir, M.) 16.1–16.15 (2021).
Williams, L., Reynolds, G. & Mumey, B. RNA transcript assembly using inexact flows. In 2019 IEEE International Conference on Bioinformatics and Biomedicine (BIBM) 1907–1914 (IEEE, 2019).
Amarasinghe, S. L. et al. Opportunities and challenges in long-read sequencing data analysis. Genome Biol. 21, 30 (2020).
Article Google Scholar
Voshall, A. & Moriyama, E. N. in Bioinformatics in the Era of Post Genomics and Big Data (ed. Abdurakhmonov, I. Y.) 15–36 (IntechOpen, 2018).
Picelli, S. et al. Full-length RNA-seq from single cells using Smart-seq2. Nat. Protoc. 9, 171–181 (2014).
Article Google Scholar
Hagemann-Jensen, M. et al. Single-cell RNA counting at allele and isoform resolution using Smart-seq3. Nat. Biotechnol. 38, 708–714 (2020).
Article Google Scholar
Pertea, G. & Pertea, M. GFF utilities: GffRead and GffCompare. F1000Res 9, 304 (2020).
Article Google Scholar
Kim, D. et al. TopHat2: accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions. Genome Biol. 14, R36 (2013).
Article Google Scholar
Dobin, A. et al. STAR: ultrafast universal RNA-seq aligner. Bioinformatics 29, 15–21 (2013).
Article Google Scholar
Kim, D., Langmead, B. & Salzberg, S. L. HISAT: a fast spliced aligner with low memory requirements. Nat. Methods 12, 357–360 (2015).
Article Google Scholar
ENCODE Project Consortium An integrated encyclopedia of DNA elements in the human genome. Nature 489, 57–74 (2012).
Shi, Q. & Shao, M. ENCODE10 dataset. Penn State Data Commons https://doi.org/10.26208/8c06-w247 (2020).
Zhang, Q., Shi, Q. & Shao, M. Code for Scallop2. Zenodo https://doi.org/10.5281/zenodo.6013717 (2022).
Zhang, Q., Shi, Q. & Shao, M. Code for Scallop2-test. Zenodo https://doi.org/10.5281/zenodo.6064927 (2022).

Download references

Acknowledgements

This work is partly supported by the US National Science Foundation (DBI-2019797 to M.S.), the US National Institutes of Health (R01HG011065 to M.S.) and the Charles K. Etner Career Development Professorship awarded to M.S. by The Pennsylvania State University. Initial algorithmic exploration of Scallop2 advancements were conducted with C. Kingsford (Computational Biology Department, School of Computer Science, Carnegie Mellon University) and were supported by the Gordon and Betty Moore Foundation (GMBF 4554 to C. Kingsford) and the US National Institutes of Health (R01GM122935 to C. Kingsford).

Author information

Authors and Affiliations

Department of Computer Science and Engineering, School of Electrical Engineering and Computer Science, The Pennsylvania State University, Pennsylvania, PA, USA
Qimin Zhang, Qian Shi & Mingfu Shao
Huck Institutes of the Life Sciences, The Pennsylvania State University, Pennsylvania, PA, USA
Mingfu Shao

Authors

Qimin Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Qian Shi
View author publications
You can also search for this author in PubMed Google Scholar
Mingfu Shao
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Q.Z., Q.S. and M.S. designed and implemented the algorithms. The experimental comparisons were primarily conducted by Q.Z. All the authors drafted and approved this manuscript.

Corresponding author

Correspondence to Mingfu Shao.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Nature Computational Science thanks Manuel Garber, Alexandru Tomescu and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Primary Handling Editor: Kaitlin McCardle, in collaboration with the Nature Computational Science team.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Fig. 1 Illustrating the construction of the splice graph \(G\) and the associated multi-end phasing paths \({{{\mathcal{C}}}}\) from read alignments of a gene locus.

Inferred splice positions in the reference genome are marked with black bars. Exons and partial exons are labeled with numbers above the reference genome. Alignment reads with the same color represent they form a group (i.e., attached with the same barcode or being the two ends of a pair in paired-end RNA-seq reads); we use gray color to represent individual read forming group of its own. The read alignments are used as input to construct the splice graph and the associated multi-end phasing paths. From the given alignments 5 (partial) exons (numbered 1–5) are identified. A source vertex s and a sink vertex t are added to the splice graph. Each numbered arrow in the splice graph represents a directed edge and its weight.

Extended Data Fig. 2 Illustration of bridging a multi-end phasing path into a single-end phasing path.

Each numbered arrow represents a directed edge and its weight. Vertices and edges in the multi-end phasing path \(C\) are blue circled and arrowed respectively. Inferred bridging paths for (l₁, l₂) and (l₂, l₃) are marked red. The single-end phasing path \(h(C)=(1,2,3,4,6)\) is bridged using the splice graph and multi-end phasing path \(C=(l_1=(1,2),l_2=(4),l_3=(6))\) as input.

Extended Data Fig. 3 Illustration of identifying false starting/ending vertices.

Inferred splice positions in the reference genome are marked with black bars. Exons and partial exons are labeled with numbers above the reference genome. Alignment reads with the same color represent they form a group (i.e., attached with the same barcode or being the two ends of a pair in paired-end RNA-seq reads); we use gray color to represent individual read forming group of its own. From the given alignments 4 partial-exons (numbered 1–4) are identified. A source vertex s and a sink vertex t are added to the splice graph. Each numbered arrow represents a directed edge and its weight. The circled vertex with number 2 (respectively 3) is classified as ending (respectively starting) vertex as there is no departing (respectively entering) junction. A pseudo-edge (2,3) from the circled vertex with number 2 to the circled vertex with number 3 will be added to the splice graph for bridging. The bridging of blue reads and red reads (the second and the third ones) will cross this pseudo-edge, giving a pseudo-score of 2 for both vertices.

Supplementary information

Supplementary Information

Supplementary Figs. 1–15, Tables 1 and 2, Notes 1–8 and Algorithm 1.

Source data

Source Data Fig. 1

Statistical source data for Fig. 1.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Zhang, Q., Shi, Q. & Shao, M. Accurate assembly of multi-end RNA-seq data with Scallop2. Nat Comput Sci 2, 148–152 (2022). https://doi.org/10.1038/s43588-022-00216-1

Download citation

Received: 27 September 2021
Accepted: 16 February 2022
Published: 28 March 2022
Issue Date: March 2022
DOI: https://doi.org/10.1038/s43588-022-00216-1

This article is cited by

Chromosome-level genome assembly of the silver pomfret Pampus argenteus
- Jiehong Wei
- Yongshuang Xiao
- Kuidong Xu
Scientific Data (2024)
Bridge over troubled transcripts
- Guillermo E. Parada
- Martin Hemberg
Nature Computational Science (2022)