Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Brief Communication
  • Published:

Accurate assembly of multi-end RNA-seq data with Scallop2

A preprint version of the article is available at bioRxiv.

Abstract

Modern RNA-sequencing (RNA-seq) protocols can produce multi-end data, where multiple reads originating from the same transcript are attached to the same barcode. The long-range information in the multi-end reads is beneficial in phasing complicated spliced isoforms, but assembly algorithms that leverage such information are lacking. Here we introduce Scallop2, a reference-based assembler optimized for multi-end RNA-seq data. The algorithmic core of Scallop2 consists of three steps: (1) using an algorithm to ‘bridge’ multi-end reads into single-end phasing paths in the context of a splice graph, (2) employing a method to refine erroneous splice graphs by utilizing multi-end reads that fail to bridge and (3) piping the refined splice graph and bridged phasing paths into an algorithm that integrates multiple phase-preserving decompositions. Tested on 561 cells in two Smart-seq3 datasets and on ten Illumina paired-end RNA-seq samples, Scallop2 substantially improves the assembly accuracy compared with two popular assemblers (StringTie2 and Scallop).

This is a preview of subscription content, access via your institution

Access options

Buy this article

Prices may be subject to local taxes which are calculated during checkout

Fig. 1: Comparison of assembly accuracy of Scallop2, StringTie2, Scallop and CLASS2.

Similar content being viewed by others

Data availability

Smart-seq3 is a single-cell protocol that generates multi-end RNA-seq data using barcoding technology. The Smart-seq3 data used in this study was published with Smart-seq3 protocol17, publicly available at https://www.ebi.ac.uk/arrayexpress/experiments/E-MTAB-8735. We use two datasets downloaded from Smart-seq3: (1) HEK293T containing 192 human cells; (2) Mouse-Fibroblast including 369 mouse cells. For the Illumina platform, we use ten human paired-end RNA-seq samples downloaded from the ENCODE project22 and we refer to these samples as ENCODE10. Their accession IDs are SRR307903, SRR307911, SRR315323, SRR315334, SRR387661, SRR534291, SRR534307, SRR534319, SRR545695 and SRR545723. The alignments of these samples are available at the Penn State Data Commons repository23. Source data are provided with this paper.

Code availability

Scallop2 is available at the Zenodo repository24 and GitHub (https://github.com/Shao-Group/scallop2). Scripts and documentation that reproduce the experimental results are available at the Zenodo repository25 and GitHub (https://github.com/Shao-Group/scallop2-test).

References

  1. Trapnell, C. et al. Transcript assembly and quantification by RNA-seq reveals unannotated transcripts and isoform switching during cell differentiation. Nat. Biotechnol. 28, 511–515 (2010).

    Article  Google Scholar 

  2. Guttman, M. et al. Ab initio reconstruction of cell type–specific transcriptomes in mouse reveals the conserved multi-exonic structure of lincRNAs. Nat. Biotechnol. 28, 503–510 (2010).

    Article  Google Scholar 

  3. Tomescu, A. I., Kuosmanen, A., Rizzi, R. & Mäkinen, V. A novel min-cost flow method for estimating transcript expression with RNA-seq. BMC Bioinformatics 14, S15 (2013).

    Article  Google Scholar 

  4. Song, L., Sabunciyan, S. & Florea, L. CLASS2: accurate and efficient splice variant annotation from RNA-seq reads. Nucleic Acids Res. 44, e98 (2016).

    Article  Google Scholar 

  5. Liu, J., Yu, T., Jiang, T. & Li, G. TransComb: genome-guided transcriptome assembly via combing junctions in splicing graphs. Genome Biol. 17, 213 (2016).

    Article  Google Scholar 

  6. Shao, M. & Kingsford, C. Accurate assembly of transcripts through phase-preserving graph decomposition. Nat. Biotechnol. 35, 1167–1169 (2017).

    Article  Google Scholar 

  7. Pertea, M. et al. StringTie enables improved reconstruction of a transcriptome from RNA-seq reads. Nat. Biotechnol. 33, 290–295 (2015).

    Article  Google Scholar 

  8. Kovaka, S. et al. Transcriptome assembly from long-read RNA-seq alignments with StringTie2. Genome Biol. 20, 278 (2019).

    Article  Google Scholar 

  9. Mao, S., Pachter, L., Tse, D. & Kannan, S. RefShannon: a genome-guided transcriptome assembler using sparse flow decomposition. PLoS ONE 15, e0232946 (2020).

    Article  Google Scholar 

  10. Tung, L. H., Shao, M. & Kingsford, C. Quantifying the benefit offered by transcript assembly with Scallop-LR on single-molecule long reads. Genome Biol. 20, 287 (2019).

    Article  Google Scholar 

  11. Shao, M. & Kingsford, C. Theory and a heuristic for the minimum path flow decomposition problem. IEEE/ACM Trans. Comput. Biol. Bioinform. 16, 658–670 (2019).

    Article  Google Scholar 

  12. Williams, L. & Tomescu, A., & Mumey, B. M. Flow decomposition with subpath constraints. In 21st International Workshop on Algorithms in Bioinformatics (WABI 2021) Vol. 201 (eds Carbone, A. & El-Kebir, M.) 16.1–16.15 (2021).

  13. Williams, L., Reynolds, G. & Mumey, B. RNA transcript assembly using inexact flows. In 2019 IEEE International Conference on Bioinformatics and Biomedicine (BIBM) 1907–1914 (IEEE, 2019).

  14. Amarasinghe, S. L. et al. Opportunities and challenges in long-read sequencing data analysis. Genome Biol. 21, 30 (2020).

    Article  Google Scholar 

  15. Voshall, A. & Moriyama, E. N. in Bioinformatics in the Era of Post Genomics and Big Data (ed. Abdurakhmonov, I. Y.) 15–36 (IntechOpen, 2018).

  16. Picelli, S. et al. Full-length RNA-seq from single cells using Smart-seq2. Nat. Protoc. 9, 171–181 (2014).

    Article  Google Scholar 

  17. Hagemann-Jensen, M. et al. Single-cell RNA counting at allele and isoform resolution using Smart-seq3. Nat. Biotechnol. 38, 708–714 (2020).

    Article  Google Scholar 

  18. Pertea, G. & Pertea, M. GFF utilities: GffRead and GffCompare. F1000Res 9, 304 (2020).

    Article  Google Scholar 

  19. Kim, D. et al. TopHat2: accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions. Genome Biol. 14, R36 (2013).

    Article  Google Scholar 

  20. Dobin, A. et al. STAR: ultrafast universal RNA-seq aligner. Bioinformatics 29, 15–21 (2013).

    Article  Google Scholar 

  21. Kim, D., Langmead, B. & Salzberg, S. L. HISAT: a fast spliced aligner with low memory requirements. Nat. Methods 12, 357–360 (2015).

    Article  Google Scholar 

  22. ENCODE Project Consortium An integrated encyclopedia of DNA elements in the human genome. Nature 489, 57–74 (2012).

  23. Shi, Q. & Shao, M. ENCODE10 dataset. Penn State Data Commons https://doi.org/10.26208/8c06-w247 (2020).

  24. Zhang, Q., Shi, Q. & Shao, M. Code for Scallop2. Zenodo https://doi.org/10.5281/zenodo.6013717 (2022).

  25. Zhang, Q., Shi, Q. & Shao, M. Code for Scallop2-test. Zenodo https://doi.org/10.5281/zenodo.6064927 (2022).

Download references

Acknowledgements

This work is partly supported by the US National Science Foundation (DBI-2019797 to M.S.), the US National Institutes of Health (R01HG011065 to M.S.) and the Charles K. Etner Career Development Professorship awarded to M.S. by The Pennsylvania State University. Initial algorithmic exploration of Scallop2 advancements were conducted with C. Kingsford (Computational Biology Department, School of Computer Science, Carnegie Mellon University) and were supported by the Gordon and Betty Moore Foundation (GMBF 4554 to C. Kingsford) and the US National Institutes of Health (R01GM122935 to C. Kingsford).

Author information

Authors and Affiliations

Authors

Contributions

Q.Z., Q.S. and M.S. designed and implemented the algorithms. The experimental comparisons were primarily conducted by Q.Z. All the authors drafted and approved this manuscript.

Corresponding author

Correspondence to Mingfu Shao.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Nature Computational Science thanks Manuel Garber, Alexandru Tomescu and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Primary Handling Editor: Kaitlin McCardle, in collaboration with the Nature Computational Science team.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Fig. 1 Illustrating the construction of the splice graph \(G\) and the associated multi-end phasing paths \({{{\mathcal{C}}}}\) from read alignments of a gene locus.

Inferred splice positions in the reference genome are marked with black bars. Exons and partial exons are labeled with numbers above the reference genome. Alignment reads with the same color represent they form a group (i.e., attached with the same barcode or being the two ends of a pair in paired-end RNA-seq reads); we use gray color to represent individual read forming group of its own. The read alignments are used as input to construct the splice graph and the associated multi-end phasing paths. From the given alignments 5 (partial) exons (numbered 1–5) are identified. A source vertex s and a sink vertex t are added to the splice graph. Each numbered arrow in the splice graph represents a directed edge and its weight.

Extended Data Fig. 2 Illustration of bridging a multi-end phasing path into a single-end phasing path.

Each numbered arrow represents a directed edge and its weight. Vertices and edges in the multi-end phasing path \(C\) are blue circled and arrowed respectively. Inferred bridging paths for (l1, l2) and (l2, l3) are marked red. The single-end phasing path \(h(C)=(1,2,3,4,6)\) is bridged using the splice graph and multi-end phasing path \(C=(l_1=(1,2),l_2=(4),l_3=(6))\) as input.

Extended Data Fig. 3 Illustration of identifying false starting/ending vertices.

Inferred splice positions in the reference genome are marked with black bars. Exons and partial exons are labeled with numbers above the reference genome. Alignment reads with the same color represent they form a group (i.e., attached with the same barcode or being the two ends of a pair in paired-end RNA-seq reads); we use gray color to represent individual read forming group of its own. From the given alignments 4 partial-exons (numbered 1–4) are identified. A source vertex s and a sink vertex t are added to the splice graph. Each numbered arrow represents a directed edge and its weight. The circled vertex with number 2 (respectively 3) is classified as ending (respectively starting) vertex as there is no departing (respectively entering) junction. A pseudo-edge (2,3) from the circled vertex with number 2 to the circled vertex with number 3 will be added to the splice graph for bridging. The bridging of blue reads and red reads (the second and the third ones) will cross this pseudo-edge, giving a pseudo-score of 2 for both vertices.

Supplementary information

Supplementary Information

Supplementary Figs. 1–15, Tables 1 and 2, Notes 1–8 and Algorithm 1.

Source data

Source Data Fig. 1

Statistical source data for Fig. 1.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zhang, Q., Shi, Q. & Shao, M. Accurate assembly of multi-end RNA-seq data with Scallop2. Nat Comput Sci 2, 148–152 (2022). https://doi.org/10.1038/s43588-022-00216-1

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1038/s43588-022-00216-1

This article is cited by

Search

Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing