Abstract
As high-throughput sequencing continues to increase in speed and throughput, routine clinical and industrial application draws closer. These 'production' settings will require enhanced quality monitoring and quality control to optimize output and reduce costs. We developed SeqControl, a framework for predicting sequencing quality and coverage using a set of 15 metrics describing overall coverage, coverage distribution, basewise coverage and basewise quality. Using whole-genome sequences of 27 prostate cancers and 26 normal references, we derived multivariate models that predict sequencing quality and depth. SeqControl robustly predicted how much sequencing was required to reach a given coverage depth (area under the curve (AUC) = 0.993), accurately classified clinically relevant formalin-fixed, paraffin-embedded samples, and made predictions from as little as one-eighth of a sequencing lane (AUC = 0.967). These techniques can be immediately incorporated into existing sequencing pipelines to monitor data quality in real time. SeqControl is available at http://labs.oicr.on.ca/Boutros-lab/software/SeqControl/.
This is a preview of subscription content, access via your institution
Access options
Subscribe to this journal
Receive 12 print issues and online access
$259.00 per year
only $21.58 per issue
Buy this article
- Purchase on Springer Link
- Instant access to full article PDF
Prices may be subject to local taxes which are calculated during checkout
Similar content being viewed by others
References
The 1000 Genomes Project Consortium. A map of human genome variation from population-scale sequencing. Nature 467, 1061–1073 (2010).
Meyer, M. et al. A high-coverage genome sequence from an archaic Denisovan individual. Science 338, 222–226 (2012).
Lachance, J. et al. Evolutionary history and adaptation from high-coverage whole-genome sequences of diverse African hunter-gatherers. Cell 150, 457–469 (2012).
Prüfer, K. et al. The bonobo genome compared with the chimpanzee and human genomes. Nature 486, 527–531 (2012).
Huang, X. et al. A map of rice genome variation reveals the origin of cultivated rice. Nature 490, 497–501 (2012).
Groenen, M.A.M. et al. Analyses of pig genomes provide insight into porcine demography and evolution. Nature 491, 393–398 (2012).
D'Hont, A. et al. The banana (Musa acuminata) genome and the evolution of monocotyledonous plants. Nature 488, 213–217 (2012).
Paterson, A.H. et al. The Sorghum bicolor genome and the diversification of grasses. Nature 457, 551–556 (2009).
Fan, H.C. et al. Non-invasive prenatal measurement of the fetal genome. Nature 487, 320–324 (2012).
Hopf, T.A. et al. Three-dimensional structures of membrane proteins from genomic sequencing. Cell 149, 1607–1621 (2012).
Cancer Genome Atlas Network. Comprehensive molecular characterization of human colon and rectal cancer. Nature 487, 330–337 (2012).
Govindan, R. et al. Genomic landscape of non-small cell lung cancer in smokers and never-smokers. Cell 150, 1121–1134 (2012).
Lupski, J.R. et al. Whole-genome sequencing in a patient with Charcot-Marie-Tooth neuropathy. N. Engl. J. Med. 362, 1181–1191 (2010).
Bras, J., Guerreiro, R. & Hardy, J. Use of next-generation sequencing and other whole-genome strategies to dissect neurological disease. Nat. Rev. Neurosci. 13, 453–464 (2012).
Tran, B. et al. Feasibility of real time next generation sequencing of cancer genes linked to drug response: results from a clinical trial. Int. J. Cancer 132, 1547–1555 (2013).
Wagle, N. et al. High-throughput detection of actionable genomic alterations in clinical tumor samples by targeted, massively parallel sequencing. Cancer Discov. 2, 82–93 (2012).
Fu, G.K. et al. Molecular indexing enables quantitative targeted RNA sequencing and reveals poor efficiencies in standard library preparations. Proc. Natl. Acad. Sci. USA 111, 1891–1896 (2014).
Clark, M.J. et al. Performance comparison of exome DNA sequencing technologies. Nat. Biotechnol. 29, 908–914 (2011).
Frith, M.C., Wan, R. & Horton, P. Incorporating sequence quality data into alignment improves DNA read mapping. Nucleic Acids Res. 38, e100 (2010).
Hower, V., Starfield, R., Roberts, A. & Pachter, L. Quantifying uniformity of mapped reads. Bioinformatics 28, 2680–2682 (2012).
Ruffalo, M., Koyutürk, M., Ray, S. & LaFramboise, T. Accurate estimation of short read mapping quality for next-generation genome sequencing. Bioinformatics 28, i349–i355 (2012).
Tae, H., Ryu, D., Sureshchandra, S. & Choi, J.-H. ESTclean: a cleaning tool for next-gen transcriptome shotgun sequencing. BMC Bioinformatics 13, 247 (2012).
Daley, T. & Smith, A.D. Predicting the molecular complexity of sequencing libraries. Nat. Methods 10, 325–327 (2013).
Lewis, F., Maughan, N.J., Smith, V., Hillan, K. & Quirke, P. Unlocking the archive—gene expression in paraffin-embedded tissue. J. Pathol. 195, 66–71 (2001).
Lehmann, U. & Kreipe, H. Real-time PCR analysis of DNA and RNA extracted from formalin-fixed and paraffin-embedded biopsies. Methods 25, 409–418 (2001).
International Cancer Genome Consortium. International network of cancer genome projects. Nature 464, 993–998 (2010).
Fieller, E.C., Hartley, H.O. & Pearson, E.S. Tests for rank correlation coefficients. I. Biometrika 44, 470–481 (1957).
Breiman, L. Random forests. Mach. Learn. 45, 5–32 (2001).
Nik-Zainal, S. et al. The life history of 21 breast cancers. Cell 149, 994–1007 (2012).
Chiu, R.W.K. et al. Non-invasive prenatal assessment of trisomy 21 by multiplexed maternal plasma DNA sequencing: large scale validity study. Br. Med. J. 342, c7401 (2011).
Forshew, T. et al. Noninvasive identification and monitoring of cancer mutations by targeted deep sequencing of plasma DNA. Sci. Transl. Med. 4, 136ra68 (2012).
Strobl, C., Boulesteix, A.-L., Kneib, T., Augustin, T. & Zeileis, A. Conditional variable importance for random forests. BMC Bioinformatics 9, 307 (2008).
Song, S. et al. qpure: a tool to estimate tumor cellularity from genome-wide single-nucleotide polymorphism profiles. PLoS ONE 7, e45835 (2012).
Fisher, S. et al. A scalable, fully automated process for construction of sequence-ready human exome targeted capture libraries. Genome Biol. 12, R1 (2011).
Li, H. et al. The Sequence Alignment/Map format and SAMtools. Bioinformatics 25, 2078–2079 (2009).
Quinlan, A.R. & Hall, I.M. BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics 26, 841–842 (2010).
Robin, X. et al. pROC: an open-source package for R and S+ to analyze and compare ROC curves. BMC Bioinformatics 12, 77 (2011).
Acknowledgements
This study was conducted with the support of Movember funds through Prostate Cancer Canada and with the additional support of the Ontario Institute for Cancer Research, funded by the Government of Ontario; with the support of Genome Canada through a Large-Scale Applied Project contract to P.C.B., S. Shah and R. Morin; and with the support of Prostate Cancer Canada, funded by the Movember Foundation, grant #RS2014-01. P.C.B. was supported by a Terry Fox Research Institute New Investigator Award and a Canadian Institutes of Health Research New Investigator Award. The authors thank all members of the Boutros lab for their support and insightful discussions, J. Livingstone and L. Heisler for assistance in data handling, and L. Stein and J. Simpson for critical comments on the manuscript.
Author information
Authors and Affiliations
Contributions
P.C.B. and L.C.C. initiated the project. M.F., T.v.d.K., R.G.B., J.D.M. and P.C.B. generated sequencing data. L.C.C., M.A.A., C.C., M.C.-S.-Y., R.d.B., R.E.D. and T.A.B. performed analysis on the sequencing data. L.C.C., M.A.A., C.C., N.J.H. and P.C.B. were responsible for statistical modeling. Research was supervised by R.G.B., J.D.M. and P.C.B. L.C.C. wrote the manuscript, which was edited and approved by all authors.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing financial interests.
Supplementary information
Supplementary Text and Figures
Supplementary Figures 1–21, Supplementary Tables 2, 3, 11 and 15–18, Supplementary Note and Supplementary Results (PDF 9909 kb)
Supplementary Table 1
Information about the CPC-GENE sample data. (XLS 29 kb)
Supplementary Table 4
Metric values calculated for all lane groupings of the oldest ten samples (10 tumour, 9 matched normal). (XLSX 190 kb)
Supplementary Table 5
Metric values calculated for all lane groupings of the newest seventeen samples (tumour and matched normal). (XLSX 54 kb)
Supplementary Table 6
Metric data, true outcome and random forest-predicted outcome for all lane-level BAMs (tumour only). (XLSX 48 kb)
Supplementary Table 7
Metric data, true outcome and random forest-predicted outcome for all lane-level BAMs (normal only). (XLSX 34 kb)
Supplementary Table 8
Metric data, true outcome and random forest-predicted outcome for all half-lane BAMs (tumour only). (XLSX 87 kb)
Supplementary Table 9
Metric data, true outcome and random forest-predicted outcome for all quarter-lane BAMs (tumour only). (XLSX 160 kb)
Supplementary Table 10
Metric data, true outcome and random forest-predicted outcome for all eighth-lane BAMs (tumour only). (XLSX 308 kb)
Supplementary Table 12
Summary of preseq v0.0.1 prediction accuracy. (XLSX 25 kb)
Supplementary Table 13
Summary of preseq v0.1.0 prediction accuracy. (XLSX 25 kb)
Supplementary Table 14
Summary of preseq v1.0.0 prediction accuracy. (XLSX 25 kb)
Rights and permissions
About this article
Cite this article
Chong, L., Albuquerque, M., Harding, N. et al. SeqControl: process control for DNA sequencing. Nat Methods 11, 1071–1075 (2014). https://doi.org/10.1038/nmeth.3094
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1038/nmeth.3094
This article is cited by
-
Valection: design optimization for validation and verification studies
BMC Bioinformatics (2018)
-
The parameter sensitivity of random forests
BMC Bioinformatics (2016)
-
Combining tumor genome simulation with crowdsourcing to benchmark somatic single-nucleotide-variant detection
Nature Methods (2015)