Abstract
The recently introduced TruSeq synthetic long read (TSLR) technology generates long and accurate virtual reads from an assembly of barcoded pools of short reads. The TSLR method provides an attractive alternative to existing sequencing platforms that generate long but inaccurate reads. We describe the truSPAdes algorithm (http://bioinf.spbau.ru/spades) for TSLR assembly and show that it results in a dramatic improvement in the quality of metagenomics assemblies.
This is a preview of subscription content, access via your institution
Access options
Subscribe to this journal
Receive 12 print issues and online access
$259.00 per year
only $21.58 per issue
Buy this article
- Purchase on SpringerLink
- Instant access to full article PDF
Prices may be subject to local taxes which are calculated during checkout
Similar content being viewed by others
Accession codes
References
Chin, C.S. et al. Nat. Methods 10, 563–569 (2013).
Lam, K.K., Khalak, A. & Tse, D. BMC Bioinformatics 15, S4 (2014).
Koren, S. et al. Genome Biol. 14, R101 (2013).
Huddleston, J. et al. Genome Res. 24, 688–696 (2014).
Salmela, L. & Rivals, E. Bioinformatics 30, 3506–3514 (2014).
Ummat, A. & Bashir, A. Bioinformatics 30, 3491–3498 (2014).
Lam, K.-K., LaButti, K., Khalak, A. & Tse, D. Bioinformatics 31, 3207–3209 (2015).
Berlin, K. et al. Nat. Biotechnol. 33, 623–630 (2015).
McCoy, R.C. et al. PLoS ONE 9, e106689 (2014).
Tilgner, H. et al. Nat. Biotechnol. 33, 736–742 (2015).
Li, R. et al. Sci. Rep. 5, 10814 (2015).
Sharon, I. et al. Genome Res. 25, 534–543 (2015).
Kuleshov, V. et al. Nat. Biotechnol. 34, 64–69 (2015).
Chitsaz, H. et al. Nat. Biotechnol. 29, 915–921 (2011).
Bankevich, A. et al. J. Comput. Biol. 19, 455–477 (2012).
Peng, Y., Leung, H.C.M., Yiu, S.M. & Chin, F.Y.L. Bioinformatics 28, 1420–1428 (2012).
Compeau, P.E., Pevzner, P.A. & Tesler, G. Nat. Biotechnol. 29, 987–991 (2011).
Kuleshov, V. et al. Nat. Biotechnol. 32, 261–266 (2014).
Simpson, J.T. & Durbin, R. Genome Res. 22, 549–556 (2012).
Gurevich, A., Saveliev, V., Vyahhi, N. & Tesler, G. Bioinformatics 29, 1072–1075 (2013).
Peng, Y., Leung, H.C.M., Yiu, S.M. & Chin, F.Y.L. Bioinformatics 27, i94–i101 (2011).
Boisvert, S., Raymond, F., Godzaridis, E., Laviolette, F. & Corbeil, J. Genome Biol. 13, R122 (2012).
Haider, B. et al. Bioinformatics 30, 2717–2722 (2014).
Howe, A.C. et al. Proc. Natl. Acad. Sci. USA 111, 4904–4909 (2014).
Marcy, Y. et al. Proc. Natl. Acad. Sci. USA 104, 11889–11894 (2007).
McLean, J.S. et al. Genome Res. 23, 867–877 (2013).
Nurk, S. et al. J. Comput. Biol. 20, 714–737 (2013).
Myers, E.W. et al. Science 287, 2196–2204 (2000).
Treangen, T.J. et al. Genome Biol. 14, R2 (2013).
Peters, B.A., Liu, J. & Drmanac, R. Front. Genet. 5, 466 (2015).
Dean, F.B., Nelson, J.R., Giesler, T.L. & Lasken, R.S. Genome Res. 11, 1095–1099 (2001).
Lasken, R.-S. & Stockwell, T.B. BMC Biotechnol. 7, 19 (2007).
Zerbino, D.-R. & Birney, E. Genome Res. 18, 821–829 (2008).
Simpson, J.T. et al. Genome Res. 19, 1117–1123 (2009).
Prjibelski, A. et al. Bioinformatics 30, 293–301 (2014).
Zimin, A.V., Smith, D.R., Sutton, G. & Yorke, J.A. Bioinformatics 24, 42–45 (2008).
Vasilinetc, I., Prjibelski, A.D., Gurevich, A., Korobeynikov, A. & Pevzner, P.A. Bioinformatics 30, 293–301 (2015).
Antipov, D., Korobeynikov, A., McLean, J.S. & Pevzner, P.A. Bioinformatics doi:10.1093/bioinformatics/btv688 (2015).
Ashton, P.M. et al. Nat. Biotechnol. 33, 296–300 (2015).
Acknowledgements
We are indebted to V. Montel, J. Stuzka and O. Schulz-Trieglaff at Illumina for many helpful discussions, sample preparation and TSLR data. We thank J. Banfiled and I. Sharon for providing their metagenomics TSLR data. This study was supported by the Russian Science Foundation (grant 14-50-00069 to A.B. and P.A.P.).
Author information
Authors and Affiliations
Contributions
A.B. developed and implemented the truSPAdes algorithm and performed benchmarking. A.B. and P.A.P. conceived the study, designed the computational experiments and wrote the manuscript.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing financial interests.
Integrated supplementary information
Supplementary Figure 1 k-mer coverage histograms.
Histograms of k-mer coverage (k = 55) for the E. coli standard isolate dataset from Bankevich et al.16 (a), the E. Coli MDA-amplified single cell dataset from Bankevich et al.16 (b), one of the barcodes of TSLR data (c) and a single 10 Kb long fragment of a barcode (d). Conventional assemblers select a coverage threshold to separate correct from erroneous k-mers. The histogram for data from the standard isolate features a smaller peak on the left (formed by largely erroneous k-mers with low coverage) and a larger peak on the right (formed by largely correct k-mers with high coverage). Thus, one can choose a proper threshold that separates correct from false k-mers38. However, for both MDA and TSLR, there is no threshold separating correct and false k-mers.
Supplementary Figure 2 Barcode span.
Construction of the barcode span: red regions have rather uniform read coverage and length close to 10 Kb. Black reads do not belong to the selected barcode spans represent read mapping artifacts and are ignored.
Supplementary Figure 3 Typical misassemblies.
Two common types of misassemblies: false (a,b,c) and chimeric (d,e,f) connections. (a) Two unrelated instances of the blue repeat are located in red (left) and yellow (right) genome fragments. These instances are flanked by short dotted segments (b). These short dotted segments correspond to short dotted edges (tips) in the de Bruijn graph. (c) Tip trimming results in a single (misassembled) edge in the de Bruijn graph representing a false connection. (d) A region of the genome formed by consecutive yellow and green segments (e) Since the yellow fragment has been erroneously amplified from the opposite strand, the reverse complementary copy is added to the end of this region resulting in a chimeric fragment (f). In the de Bruijn graph, the corresponding yellow solid edge has two outgoing edges: one for each connection between the yellow and green parts of the genome fragment. One of these connections represents an erroneous chimeric connection (transition from solid yellow to dashed green). We note that our explanation for the experimental cause of the chimeric connection is just a hypothesis that accurately reflects the computational artifacts we observe.
Supplementary Figure 4 Iterative assembly.
A fragment of a genome along with four reads (1st panel) and de Bruijn graphs of these reads constructed for k = 3 (2nd panel), k = 4 (3rd panel), and k = 5 (4th panel). The parameter k = 4 represents the “sweet spot” in the iterative assembly since the de Bruijn graph for k = 3 is over-tangled while the de Bruijn graph for k = 5 is over-fragmented.
Supplementary Figure 5 TruSPAdes pseudocode.
Outline of truSPAdes pipeline. TruSPAdes specific modifications are highlighted in blue.
Supplementary information
Supplementary Text and Figures
Supplementary Figures 1–5 (PDF 598 kb)
Rights and permissions
About this article
Cite this article
Bankevich, A., Pevzner, P. TruSPAdes: barcode assembly of TruSeq synthetic long reads. Nat Methods 13, 248–250 (2016). https://doi.org/10.1038/nmeth.3737
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1038/nmeth.3737
This article is cited by
-
SLR-superscaffolder: a de novo scaffolding tool for synthetic long reads using a top-to-bottom scheme
BMC Bioinformatics (2021)
-
Optimizing sequencing protocols for leaderboard metagenomics by combining long and short reads
Genome Biology (2019)
-
High-quality genome sequences of uncultured microbes by assembly of read clouds
Nature Biotechnology (2018)
-
Comparison of carnivore, omnivore, and herbivore mammalian genomes with a new leopard assembly
Genome Biology (2016)