An open resource for accurately benchmarking small variant and reference calls

Zook, Justin M.; McDaniel, Jennifer; Olson, Nathan D.; Wagner, Justin; Parikh, Hemang; Heaton, Haynes; Irvine, Sean A.; Trigg, Len; Truty, Rebecca; McLean, Cory Y.; De La Vega, Francisco M.; Xiao, Chunlin; Sherry, Stephen; Salit, Marc

doi:10.1038/s41587-019-0074-6

Resource
Published: 01 April 2019

An open resource for accurately benchmarking small variant and reference calls

Justin M. Zook ORCID: orcid.org/0000-0003-2309-8402¹,
Jennifer McDaniel¹,
Nathan D. Olson¹,
Justin Wagner¹,
Hemang Parikh¹,
Haynes Heaton^2,3,
Sean A. Irvine⁴,
Len Trigg⁴,
Rebecca Truty⁵,
Cory Y. McLean^6,7,
Francisco M. De La Vega ORCID: orcid.org/0000-0002-9228-2097⁸,
Chunlin Xiao⁹,
Stephen Sherry⁹ &
…
Marc Salit ORCID: orcid.org/0000-0003-1624-5195^1,10,11

Nature Biotechnology volume 37, pages 561–566 (2019)Cite this article

9687 Accesses
181 Citations
65 Altmetric
Metrics details

Subjects

Abstract

Benchmark small variant calls are required for developing, optimizing and assessing the performance of sequencing and bioinformatics methods. Here, as part of the Genome in a Bottle (GIAB) Consortium, we apply a reproducible, cloud-based pipeline to integrate multiple short- and linked-read sequencing datasets and provide benchmark calls for human genomes. We generate benchmark calls for one previously analyzed GIAB sample, as well as six genomes from the Personal Genome Project. These new genomes have broad, open consent, making this a ‘first of its kind’ resource that is available to the community for multiple downstream applications. We produce 17% more benchmark single nucleotide variations, 176% more indels and 12% larger benchmark regions than previously published GIAB benchmarks. We demonstrate that this benchmark reliably identifies errors in existing callsets and highlight challenges in interpreting performance metrics when using benchmarks that are not perfect or comprehensive. Finally, we identify strengths and weaknesses of callsets by stratifying performance according to variant type and genome context.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on Springer Link
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

**Fig. 1: Arbitration process used to form our benchmark set from multiple technologies and callsets.**

**Fig. 2: Complex variant discordant between GIAB and Illumina PG.**

Variant calling and benchmarking in an era of complete human genome sequences

Article 14 April 2023

vcfdist: accurately benchmarking phased small variant calls in human genomes

Article Open access 09 December 2023

Comprehensive evaluation and characterisation of short read general-purpose structural variant calling software

Article Open access 19 July 2019

Data availability

Raw sequence data were previously published in Scientific Data (https://doi.org/10.1038/sdata.2016.25) and were deposited in the NCBI SRA with the accession codes SRX1049768–SRX1049855, SRX847862 –SRX848317, SRX1388368–SRX1388459, SRX1388732–SRX1388743, SRX852932–SRX852936, SRX847094, SRX848742–SRX848744, SRX326642, SRX1497273 and SRX1497276. 10x Genomics Chromium bam files used are available at ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/AshkenazimTrio/analysis/10XGenomics_ChromiumGenome_LongRanger2.0_06202016/. The benchmark vcf and bed files resulting from work in this manuscript are available in the NISTv.3.3.2 directory under each genome on the GIAB FTP release folder ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/release/ and, in the future, updated calls will be in the ‘recent’ directory under each genome. The data used in this manuscript and other datasets for these genomes are available at ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/, as well as in NCBI BioProject No. PRJNA200694.

Code availability

All code for analyzing genome sequencing data to generate benchmark variants and regions developed for this manuscript is available in a GitHub repository at https://github.com/jzook/genome-data-integration. Publicly available software used to generate input callsets includes novoalign v.3.02.07, samtools v.0.1.18, GATK v.3.5, Freebayes v.0.9.20, Complete Genomics tools v.2.5.0.33, Torrent Variant Caller v.4.4, LifeScope v.2.5.1, LongRanger v.2.0, GenomeWarp, rtg-tools v.3.7.1 and Sentieon v.201611.rc1.

References

Zook, J. M. et al. Integrating human sequence data sets provides a resource of benchmark SNP and indel genotype calls. Nat. Biotechnol. 32, 246–251 (2014).
Article CAS PubMed Google Scholar
Patwardhan, A. et al. Achieving high-sensitivity for clinical applications using augmented exome sequencing. Genome Med. 7, 71 (2015).
Article PubMed PubMed Central Google Scholar
Lincoln, S. E. et al. A systematic comparison of traditional and multigene panel testing for hereditary breast and ovarian cancer genes in more than 1000 patients. J. Mol. Diagnostics 17, 533–544 (2015).
Article Google Scholar
Telenti, A. et al. Deep sequencing of 10,000 human genomes. Proc. Natl Acad. Sci. USA 113, 11901–11906 (2016).
Article CAS PubMed PubMed Central Google Scholar
Cornish, A. & Guda, C. A comparison of variant calling pipelines using Genome in a Bottle as areference. Biomed. Res. Int. 2015, 1–11 (2015).
Article Google Scholar
Poplin, R. et al. A universal SNP and small-indel variant caller using deep neural networks. Nat. Biotechnol. 36, 983 (2018).
CAS PubMed Google Scholar
Zook, J. M. et al. Extensive sequencing of seven human genomes to characterize benchmark reference materials. Sci. Data 3, 160025 (2016).
Article CAS PubMed PubMed Central Google Scholar
Eberle, M. A. et al. A reference data set of 5.4 million phased human variants validated by genetic inheritance from sequencing a three-generation 17-member pedigree. Genome Res. 27, 157–164 (2017).
Article CAS PubMed PubMed Central Google Scholar
Cleary, J. G. et al. Joint variant and de novo mutation identification on pedigrees from high-throughput sequencing data. J. Comput. Biol. 21, 405–419 (2014).
Article CAS PubMed Google Scholar
Krusche, P. et al. Best practices for benchmarking germline small variant calls in human genomes. Nat. Biotechnol. https://doi.org/10.1038/s41587-019-0054-x (2019).
Ball, M. P. et al. A public resource facilitating clinical use of genomes. Proc. Natl Acad. Sci. USA 109, 11920–11927 (2012).
Article CAS PubMed PubMed Central Google Scholar
Kudalkar, E. M. et al. Multiplexed reference materials as controls for diagnostic next-generation sequencing: a pilot investigating applications for hypertrophic cardiomyopathy. J. Mol. Diagn. 18, 882–889 (2016).
Article CAS PubMed Google Scholar
Lincoln, S. E. et al. An interlaboratory study of complex variant detection. Preprint at bioRxiv https://doi.org/10.1101/218529 (2017).
Zhou, B. et al. Extensive and deep sequencing of the Venter/HuRef genome for developing and benchmarking genome analysis tools. Sci. Data 5, 180261 (2018).
Article CAS PubMed PubMed Central Google Scholar
Mu, J. C. et al. Leveraging long read sequencing from a single individual to provide a comprehensive resource for benchmarking variant calling methods. Sci. Rep. 5, 14493 (2015).
Article CAS PubMed PubMed Central Google Scholar
English, A. C. et al. Assessing structural variation in a personal genome—towards a human reference diploid genome. BMC Genomics 16, 286 (2015).
Article PubMed PubMed Central Google Scholar
Li, H. et al. A synthetic-diploid benchmark for accurate variant-calling evaluation. Nat. Methods 15, 595–597 (2018).
Article PubMed PubMed Central Google Scholar
Conrad, D. F. et al. Variation in genome-wide mutation rates within and between human families. Nat. Genet. 43, 712–714 (2011).
Article CAS PubMed PubMed Central Google Scholar
Auton, A. et al. A global reference for human genetic variation. Nature 526, 68–74 (2015).
Article PubMed Google Scholar
Beck, T. F. et al. Systematic evaluation of Sanger validation of next-generation sequencing variants. Clin. Chem. 62, 647–654 (2016).
Article CAS PubMed PubMed Central Google Scholar
Marks, P. et al. Resolving the full spectrum of human genome variation using linked-reads. Preprint at bioRxiv https://doi.org/10.1101/230946 (2018).
Wenger, A. M. et al. Highly-accurate long-read sequencing improves variant detection and assembly of a human genome. Preprint at bioRxiv https://doi.org/10.1101/519025 (2019).
DePristo, M. A. et al. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat. Genet. 43, 491–498 (2011).
Article CAS PubMed PubMed Central Google Scholar
McKenna, A. et al. The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res. 20, 1297–1303 (2010).
Article CAS PubMed PubMed Central Google Scholar
Garrison, E. & Marth, G. Haplotype-based variant detection from short-read sequencing. Preprint at https://arxiv.org/abs/1207.3907v2 (2012).
Drmanac, R. et al. Human genome sequencing using unchained base reads on self-assembling DNA nanoarrays. Science 327, 78–81 (2010).
Article CAS PubMed Google Scholar
Kendig, K. et al. Computational performance and accuracy of Sentieon DNASeq variant calling workflow. Preprint at bioRxiv 396325 https://doi.org/10.1101/396325 (2018).
Toptaş, B. Ç., Rakocevic, G., Kómár, P. & Kural, D. Comparing complex variants in family trios. Bioinformatics https://doi.org/10.1093/bioinformatics/bty443 (2018).

Download references

Acknowledgements

We thank the many contributors to GIAM Consortium discussions. We especially thank R. Saldana and the Sentieon team for advice on running the Sentieon pipeline; A. Carroll and the DNAnexus team for advice on implementing the pipeline in DNAnexus; F. Hyland, S. Ghosh, K. Zhao and J. Bodeau at ThermoFisher for advice on integrating Ion exome and SOLiD genome data; D. Church and V. Schneider for helpful discussions about GRCh38; and many individuals for providing feedback on the current version and previous versions of our calls. Selected commercial equipment, instruments or materials are identified to specify the adequacy of experimental conditions or reported results. Such identification does not imply recommendation or endorsement by the National Institute of Standards, nor does it imply that the equipment, instruments or materials identified are necessarily the best available for the purpose. C.X. and S.S. were supported by the Intramural Research Program of the National Library of Medicine, National Institutes of Health. J.Z., M.S., N.O. and J.W. were supported by the National Institute of Standards and Technology and an interagency agreement with the Food and Drug Administration.

Author information

Authors and Affiliations

Material Measurement Laboratory, National Institute of Standards and Technology, Gaithersburg, MD, USA
Justin M. Zook, Jennifer McDaniel, Nathan D. Olson, Justin Wagner, Hemang Parikh & Marc Salit
10x Genomics, Pleasanton, CA, USA
Haynes Heaton
Wellcome Trust Sanger Institute,, Hinxton, Cambridge, UK
Haynes Heaton
Real Time Genomics, Hamilton, New Zealand
Sean A. Irvine & Len Trigg
Invitae Corporation, San Francisco, CA, USA
Rebecca Truty
Verily Life Sciences, South San Francisco, CA, USA
Cory Y. McLean
Google Inc., Mountain View, CA, USA
Cory Y. McLean
Department of Biomedical Data Science, Stanford University School of Medicine, Stanford, CA, USA
Francisco M. De La Vega
National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, USA
Chunlin Xiao & Stephen Sherry
Joint Initiative for Metrology in Biology, Stanford, CA, USA
Marc Salit
Department of Bioengineering, Stanford University, Stanford, CA, USA
Marc Salit

Authors

Justin M. Zook
View author publications
You can also search for this author in PubMed Google Scholar
Jennifer McDaniel
View author publications
You can also search for this author in PubMed Google Scholar
Nathan D. Olson
View author publications
You can also search for this author in PubMed Google Scholar
Justin Wagner
View author publications
You can also search for this author in PubMed Google Scholar
Hemang Parikh
View author publications
You can also search for this author in PubMed Google Scholar
Haynes Heaton
View author publications
You can also search for this author in PubMed Google Scholar
Sean A. Irvine
View author publications
You can also search for this author in PubMed Google Scholar
Len Trigg
View author publications
You can also search for this author in PubMed Google Scholar
Rebecca Truty
View author publications
You can also search for this author in PubMed Google Scholar
Cory Y. McLean
View author publications
You can also search for this author in PubMed Google Scholar
Francisco M. De La Vega
View author publications
You can also search for this author in PubMed Google Scholar
Chunlin Xiao
View author publications
You can also search for this author in PubMed Google Scholar
Stephen Sherry
View author publications
You can also search for this author in PubMed Google Scholar
Marc Salit
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

J.M.Z., L.T., N.D.O., J.W. and M.S. wrote the manuscript. J.M.Z., J.M., F.M.D., N.D.O., J.W., M.S. and H.P. designed and implemented the integration process. H.H., J.M. and J.M.Z. analyzed and integrated the 10x Genomics data. R.T., J.M. and J.M.Z. analyzed and integrated the Complete Genomics data. S.A.I., L.T., F.M.D., J.M. and J.M.Z. designed and implemented the phasing and robust trio analysis. C.Y.M., J.M. and J.M.Z. designed and implemented the robust GRCh38 liftover analysis. C.X. and S.S. managed and analyzed data. All authors contributed to GIAB discussions planning this work.

Corresponding author

Correspondence to Justin M. Zook.

Ethics declarations

Competing interests

R.T. is an employee of, and holds stock in, Invitae. H.H. was an employee of 10x Genomics. S.A.I. and L.T. are employees of Real Time Genomics. C.Y.M. is an employee of Verily Life Sciences and Google.

Additional information

Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Integrated supplementary information

Supplementary Figure 1 Fraction of each chromosome covered by benchmark regions for each genome.

Fraction of the assembled (i.e., non-N) bases in GRCh37 that are covered by the benchmark regions for each genome (HG001 to HG007), separated by chromosome.

Supplementary Figure 2 Overall flow of execution of the code used to integrate VCF and BED files from each method and form benchmark VCF and BED files.

Diagram of the input files (light orange boxes) and output files (dark orange boxes) of each script (blue boxes) used to integrate callsets from each method and form the benchmark set.

Supplementary Figure 3 Preprocessing and merging of VCF and BED files from each input callset.

The Callset Table gives metadata about each input callset, including which difficult regions to exclude from each callset’s callable bed file. This table is used to generate callable bed files for each callset and form a merged vcf that includes the genotype from each callset and annotations that indicate whether it falls in each callset’s callable bed file.

Supplementary Figure 4 Processing union VCF to arbitrate between callsets and form benchmark VCF.

Process used to determine if a consensus genotype call can be made from all trusted input callsets for each line in the union VCF. In the first iteration, each callset’s callable regions are used to determine if a callset can be trusted, and calls where all trusted callsets agree and at least two different platforms support the call are used to train the one class filtering model in Supplementary Figure 5. In the second iteration, each callset’s callable regions are again used to determine if a callset can be trusted, but filtered calls are also excluded. To be included in the benchmark set, all trusted callsets must have the same genotype, and support from only one platform is needed.

Supplementary Figure 5 One-class model used to filter calls from each input callset that have outlier annotations.

To determine whether a call from each input callset can be trusted, we use a simple one-class model that finds calls from each callset that have outlier values for any of the user-specified annotations. For the training set, we use the sites from each input callset that agree with the consensus calls supported by at least two technologies (found in the first iteration of the process in Supplementary Figure 4). The filtered bed files from each callset are used to annotate the union VCF used in the second and final iteration of Supplementary Figure 4.

Supplementary information

Supplementary Information

Supplementary Figures 1–5, Supplementary Tables 1–5 and Supplementary Notes 1–10

Reporting Summary

Supplementary Data 1

Detailed manual curation results for discordant sites

Rights and permissions

Reprints and permissions

About this article

Cite this article

Zook, J.M., McDaniel, J., Olson, N.D. et al. An open resource for accurately benchmarking small variant and reference calls. Nat Biotechnol 37, 561–566 (2019). https://doi.org/10.1038/s41587-019-0074-6

Download citation

Received: 25 May 2018
Accepted: 19 February 2019
Published: 01 April 2019
Issue Date: May 2019
DOI: https://doi.org/10.1038/s41587-019-0074-6

This article is cited by

Validated WGS and WES protocols proved saliva-derived gDNA as an equivalent to blood-derived gDNA for clinical and population genomic analyses
- Katerina Kvapilova
- Pavol Misenko
- Zbynek Kozmik
BMC Genomics (2024)
Utility of long-read sequencing for All of Us
- M. Mahmoud
- Y. Huang
- F. J. Sedlazeck
Nature Communications (2024)
Benchmarking long-read aligners and SV callers for structural variation detection in Oxford nanopore sequencing data
- Asmaa A. Helal
- Bishoy T. Saad
- Khaled M. Aboshanab
Scientific Reports (2024)
Reference Materials for Improving Reliability of Multiomics Profiling
- Luyao Ren
- Leming Shi
- Yuanting Zheng
Phenomics (2024)
A cost-effective sequencing method for genetic studies combining high-depth whole exome and low-depth whole genome
- Claude Bhérer
- Robert Eveleigh
- Daniel Taliun
npj Genomic Medicine (2024)