Abstract
Tandem repeats (TRs) are highly polymorphic in the human genome, have thousands of associated molecular traits and are linked to over 60 disease phenotypes. However, they are often excluded from at-scale studies because of challenges with variant calling and representation, as well as a lack of a genome-wide standard. Here, to promote the development of TR methods, we created a catalog of TR regions and explored TR properties across 86 haplotype-resolved long-read human assemblies. We curated variants from the Genome in a Bottle (GIAB) HG002 individual to create a TR dataset to benchmark existing and future TR analysis methods. We also present an improved variant comparison method that handles variants greater than 4 bp in length and varying allelic representation. The 8.1% of the genome covered by the TR catalog holds ~24.9% of variants per individual, including 124,728 small and 17,988 large variants for the GIAB HG002 ‘truth-set’ TR benchmark. We demonstrate the utility of this pipeline across short-read and long-read technologies.
This is a preview of subscription content, access via your institution
Access options
Access Nature and 54 other Nature Portfolio journals
Get Nature+, our best-value online-access subscription
$29.99 / 30 days
cancel any time
Subscribe to this journal
Receive 12 print issues and online access
$209.00 per year
only $17.42 per issue
Buy this article
- Purchase on Springer Link
- Instant access to full article PDF
Prices may be subject to local taxes which are calculated during checkout
Similar content being viewed by others
Data availability
The TR catalog (version 1.2) can be found at https://zenodo.org/records/8387564 (ref. 74). Supplementary Table 4 holds the paths to the input assemblies used to create the pVCF. The pVCF can be found at https://zenodo.org/records/6975244 (ref. 76). The TandemRepeat benchmark is hosted at https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/release/AshkenazimTrio/HG002_NA24385_son/TandemRepeats_v1.0 (ref. 83). Comparison VCFs from TR callers HipSTR, GangSTR, Medaka and TRGT and whole-genome VCFs from DeepVariant, BioGraph and Sniffles are available at https://zenodo.org/records/10724503 (ref. 84).
Code availability
All code created for this project is available under an open-source license. Analysis scripts for this project are hosted at https://github.com/ACEnglish/adotto/ (ref. 85). Truvari can be found at https://github.com/ACEnglish/truvari/ (ref. 86). Laytr can be found at https://github.com/ACEnglish/laytr/ (ref. 87). A lightweight version of the TR catalog creation process is available as a snakemake pipeline at https://github.com/nate-d-olson/adotto-smk (ref. 88). The overlap permutation tool regioners can be downloaded from https://github.com/ACEnglish/regioners (ref. 89).
References
Levinson, G. & Gutman, G. A. Slipped-strand mispairing: a major mechanism for DNA sequence evolution. Mol. Biol. Evol. 4, 203–221 (1987).
Fan, H. & Chu, J.-Y. A brief review of short tandem repeat mutation. Genom. Proteom. Bioinform. 5, 7–14 (2007).
Shriver, M. D., Jin, L., Chakraborty, R. & Boerwinkle, E. VNTR allele frequency distributions under the stepwise mutation model: a computer simulation approach. Genetics 134, 983–993 (1993).
Wright, J. M. Mutation at VNTRs: are minisatellites the evolutionary progeny of microsatellites? Genome 37, 345–347 (1994).
Willems, T. et al. The landscape of human STR variation. Genome Res. 24, 1894–1904 (2014).
Ren, J., Gu, B. & Chaisson, M. J. P. vamos: variable-number tandem repeats annotation using efficient motif sets. Genome Biol. 24, 175 (2023).
Noyes, M. D. et al. Familial long-read sequencing increases yield of de novo mutations. Am. J. Hum. Genet. 109, 631–646 (2022).
DeJesus-Hernandez, M. et al. Expanded GGGGCC hexanucleotide repeat in noncoding region of C9ORF72 causes chromosome 9p-linked FTD and ALS. Neuron 72, 245–256 (2011).
Depienne, C. & Mandel, J.-L. 30 years of repeat expansion disorders: what have we learned and what are the remaining challenges? Am. J. Hum. Genet. 108, 764–785 (2021).
Mirceta, M., Shum, N., Schmidt, M. H. M. & Pearson, C. E. Fragile sites, chromosomal lesions, tandem repeats, and disease. Front. Genet. 13, 985975 (2022).
Hannan, A. J. Repeat DNA expands our understanding of autism spectrum disorder. Nature 589, 200–202 (2021).
Hannan, A. J. Tandem repeats mediating genetic plasticity in health and disease. Nat. Rev. Genet. 19, 286–298 (2018).
Stanley, U. et al. Forensic DNA profiling: autosomal short tandem repeat as a prominent marker in crime investigation. Malays. J. Med. Sci. 27, 22–35 (2020).
Hall, C. L. et al. Accurate profiling of forensic autosomal STRs using the Oxford Nanopore Technologies MinION device. Forensic Sci. Int. Genet. 56, 102629 (2022).
Warner, J. P. et al. A general method for the detection of large CAG repeat expansions by fluorescent PCR. J. Med. Genet. 33, 1022–1026 (1996).
Jeffreys, A. J., Wilson, V. & Thein, S. L. Hypervariable ‘minisatellite’ regions in human DNA. Nature 314, 67–73 (1985).
Dolzhenko, E. et al. ExpansionHunter: a sequence-graph based tool to analyze variation in short tandem repeat regions. Bioinformatics 35, 4754–4756 (2019).
Willems, T. et al. Genome-wide profiling of heritable and de novo STR variations. Nat. Methods 14, 590–592 (2017).
Mousavi, N., Shleizer-Burko, S., Yanicky, R. & Gymrek, M. Profiling the genome-wide landscape of tandem repeat expansions. Nucleic Acids Res. 47, e90 (2019).
Dolzhenko, E. et al. Characterization and visualization of tandem repeats at genome scale. Nat. Biotechnol. https://doi.org/10.1038/s41587-023-02057-3 (2024).
Chiu, R., Rajan-Babu, I.-S., Friedman, J. M. & Birol, I. Straglr: discovering and genotyping tandem repeat expansions using whole genome long-read sequences. Genome Biol. 22, 224 (2021).
Nurk, S. et al. The complete sequence of a human genome. Science 376, 44–53 (2022).
Aganezov, S. et al. A complete reference genome improves analysis of human genetic variation. Science 376, eabl3533 (2022).
Rhie, A. et al. The complete sequence of a human Y chromosome. Nature 621, 344–354 (2023).
Olson, N. D. et al. Variant calling and benchmarking in an era of complete human genome sequences. Nat. Rev. Genet. 24, 464–483 (2023).
Majidian, S., Agustinho, D. P., Chin, C.-S., Sedlazeck, F. J. & Mahmoud, M. Genomic variant benchmark: if you cannot measure it, you cannot improve it. Genome Biol. 24, 221 (2023).
Wagner, J. et al. Benchmarking challenging small variants with linked and long reads. Cell Genom. 2, 100128 (2022).
Zook, J. M. et al. A robust benchmark for detection of germline large deletions and insertions. Nat. Biotechnol. 38, 1347–1355 (2020).
Wagner, J. et al. Curated variation benchmarks for challenging medically relevant autosomal genes. Nat. Biotechnol. 40, 672–680 (2022).
English, A. C., Menon, V. K., Gibbs, R. A., Metcalf, G. A. & Sedlazeck, F. J. Truvari: refined structural variant comparison preserves allelic diversity. Genome Biol. 23, 271 (2022).
Yang, J. & Chaisson, M. J. P. TT-Mars: structural variants assessment based on haplotype-resolved assemblies. Genome Biol. 23, 110 (2022).
Audano, P. A. & Beck, C. R. Small polymorphisms are a source of ancestral bias in structural variant breakpoint placement. Genome Res. 34, 7–19 (2024).
Fu, Y., Mahmoud, M., Muraliraman, V. V., Sedlazeck, F. J. & Treangen, T. J. Vulcan: improved long-read mapping and structural variant calling via dual-mode alignment. GigaScience 10, giab063 (2021).
Gelfand, Y., Rodriguez, A. & Benson, G. TRDB—the Tandem Repeats Database. Nucleic Acids Res. 35, D80–D87 (2007).
Halman, A., Dolzhenko, E. & Oshlack, A. STRipy: a graphical application for enhanced genotyping of pathogenic short tandem repeats in sequencing data. Hum. Mutat. 43, 859–868 (2022).
Kent, W. J. et al. The human genome browser at UCSC. Genome Res. 12, 996–1006 (2002).
Saini, S., Mitra, I., Mousavi, N., Fotsing, S. F. & Gymrek, M. A reference haplotype panel for genome-wide imputation of short tandem repeats. Nat. Commun. 9, 4397 (2018).
Benson, G. Tandem Repeats Finder: a program to analyze DNA sequences. Nucleic Acids Res. 27, 573–580 (1999).
Smit, A., Hubley, R. & Green, P. RepeatMasker. http://www.repeatmasker.org (2013).
Wlodzimierz, P., Hong, M. & Henderson, I. R. TRASH: tandem repeat annotation and structural hierarchy. Bioinformatics 39, btad308 (2023).
Novák, P., Neumann, P. & Macas, J. Global analysis of repetitive DNA from unassembled sequence reads using RepeatExplorer2. Nat. Protoc. 15, 3745–3776 (2020).
Delucchi, M., Näf, P., Bliven, S. & Anisimova, M. TRAL 2.0: tandem repeat detection with circular profile hidden Markov models and evolutionary aligner. Front. Bioinform. 1, 691865 (2021).
El-Sawy, M. & Deininger, P. Tandem insertions of Alu elements. Cytogenet. Genome Res. 108, 58–62 (2004).
Moretti, T. R. et al. Population data on the expanded CODIS core STR loci for eleven populations of significance for forensic DNA analyses in the United States. Forensic Sci. Int. Genet. 25, 175–181 (2016).
Collins, R. L. et al. A structural variation reference for medical and population genetics. Nature 581, 444–451 (2020).
Stevanovski, I. et al. Comprehensive genetic diagnosis of tandem repeat expansion disorders with programmable targeted nanopore sequencing. Sci. Adv. 8, eabm5386 (2022).
Pellerin, D. et al. Deep intronic FGF14 GAA repeat expansion in late-onset cerebellar ataxia. N. Engl. J. Med. 388, 128–141 (2022).
Tan, D. et al. CAG repeat expansion in THAP11 is associated with a novel spinocerebellar ataxia. Mov. Disord. 38, 1282–1293 (2023).
Mukamel, R. E. et al. Protein-coding repeat polymorphisms strongly shape diverse human phenotypes. Science 373, 1499–1505 (2021).
Liu, Z. et al. Inconsistent genotyping call at DYS389 locus and implications for interpretation. Int. J. Legal Med. 132, 1043–1048 (2018).
White, P. S., Tatum, O. L., Deaven, L. L. & Longmire, J. L. New, male-specific microsatellite markers from the human Y chromosome. Genomics 57, 433–437 (1999).
Vinces, M. D., Legendre, M., Caldara, M., Hagihara, M. & Verstrepen, K. J. Unstable tandem repeats in promoters confer transcriptional evolvability. Science 324, 1213–1216 (2009).
Sulovari, A. et al. Human-specific tandem repeat expansion and differential gene expression during primate evolution. Proc. Natl Acad. Sci. USA 116, 23243–23253 (2019).
Annear, D. J. et al. Abundancy of polymorphic CGG repeats in the human genome suggest a broad involvement in neurological disease. Sci. Rep. 11, 2515 (2021).
Liao, W.-W. et al. A draft human pangenome reference. Nature 617, 312–324 (2023).
Ebert, P. et al. Haplotype-resolved diverse human genomes and integrated analysis of structural variation. Science 372, eabf7117 (2021).
Garg, S. et al. Chromosome-scale, haplotype-resolved assembly of human genomes. Nat. Biotechnol. 39, 309–312 (2021).
Li, H. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics 34, 3094–3100 (2018).
Cheng, H., Concepcion, G. T., Feng, X., Zhang, H. & Li, H. Haplotype-resolved de novo assembly using phased assembly graphs with hifiasm. Nat. Methods 18, 170–175 (2021).
Jarvis, E. D. et al. Semi-automated assembly of high-quality diploid human reference genomes. Nature 611, 519–531 (2022).
Dunn, T. & Narayanasamy, S. vcfdist: accurately benchmarking phased small variant calls in human genomes. Nat. Commun. 14, 8149 (2023).
Cleary, J. G. et al. Comparing variant call files for performance benchmarking of next-generation sequencing variant calling pipelines. Preprint at bioRxiv https://doi.org/10.1101/023754 (2015).
Tan, A., Abecasis, G. R. & Kang, H. M. Unified representation of genetic variants. Bioinformatics 31, 2202–2204 (2015).
Marco-Sola, S., Moure, J. C., Moreto, M. & Espinosa, A. Fast gap-affine pairwise alignment using the wavefront algorithm. Bioinformatics 37, btaa777 (2020).
Sedlazeck, F. J. et al. Accurate detection of complex structural variations using single-molecule sequencing. Nat. Methods 15, 461–468 (2018).
Park, J., Kaufman, E., Valdmanis, P. N. & Bafna, V. TRviz: a Python library for decomposing and visualizing tandem repeat sequences. Bioinform. Adv. 3, vbad058 (2023).
Krause, A. et al. Junctophilin 3 (JPH3) expansion mutations causing Huntington disease like 2 (HDL2) are common in South African patients with African ancestry and a Huntington disease phenotype. Am. J. Med. Genet. B 168, 573–585 (2015).
Wieben, E. D. et al. A common trinucleotide repeat expansion within the transcription factor 4 (TCF4, E2-2) gene predicts Fuchs corneal dystrophy. PLoS ONE 7, e49083 (2012).
Jam, H. Z. et al. A deep population reference panel of tandem repeat variation. Nat. Commun. 14, 6711 (2023).
Bakhtiari, M., Shleizer-Burko, S., Gymrek, M., Bansal, V. & Bafna, V. Targeted genotyping of variable number tandem repeats with adVNTR. Genome Res. 28, 1709–1719 (2018).
Sonay, T. B. et al. Tandem repeat variation in human and great ape populations and its impact on gene expression divergence. Genome Res. 25, 1591–1599 (2015).
Quinlan, A. R. & Hall, I. M. BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics 26, 841–842 (2010).
Howe, K. L. et al. Ensembl 2021. Nucleic Acids Res. 49, D884–D891 (2020).
English, A. Project Adotto tandem-repeat regions and annotations. Zenodo 10.5281/zenodo.8387564 (2022).
Danecek, P. et al. Twelve years of SAMtools and BCFtools. GigaScience 10, giab008 (2021).
English, A. Project Adotto whole-genome variants. Zenodo 10.5281/zenodo.6975244 (2022).
Li, H. et al. A synthetic-diploid benchmark for accurate variant-calling evaluation. Nat. Methods 15, 595–597 (2018).
Chin, C.-S. et al. A diploid assembly-based benchmark for variants in the major histocompatibility complex. Nat. Commun. 11, 4794 (2020).
Wootton, J. C. & Federhen, S. Statistics of local complexity in amino acid sequences and sequence databases. Comput. Chem. 17, 149–163 (1993).
Šošić, M. & Šikić, M. Edlib: a C/C++ library for fast, exact sequence alignment using edit distance. Bioinformatics 33, btw753 (2016).
Bonfield, J. K. et al. HTSlib: C library for reading/writing high-throughput sequencing data. GigaScience 10, giab007 (2021).
Katoh, K. & Standley, D. M. MAFFT multiple sequence alignment software version 7: improvements in performance and usability. Mol. Biol. Evol. 30, 772–780 (2013).
English, A. et al. GIAB TandemRepeats benchmark v1.0. https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/release/AshkenazimTrio/HG002_NA24385_son/TandemRepeats_v1.0 (2023).
English, A. et al. GIAB TR comparison VCFs. Zenodo 10.5281/zenodo.10724503 (2024).
English, A. et al. Working space for the GIAB TR benchmarking project. GitHub https://github.com/ACEnglish/adotto (2023).
English, A. Structural variant toolkit for VCFs. GitHub https://github.com/ACEnglish/truvari (2023).
English, A. et al. Library for variant benchmarking stratification. GitHub https://github.com/ACEnglish/laytr (2023).
Olson, N. A snakemake based pipeline to build Adotto TR databases. GitHub https://github.com/nate-d-olson/adotto-smk (2023).
English, A. A rust implementation of regioneR for interval overlap permutation testing. GitHub https://github.com/ACEnglish/regioners (2023).
Acknowledgements
We would like to thank the GIAB community for constant support. We thank J. McDaniel for very helpful comments on the paper, M. Wykes and S. Nurk for assistance in processing Medaka results and V. Bafna for contributions to the TR catalog. A.C.E. and F.J.S. were supported by HHSN268201800002I, U01AG058589, 1U01HG011758-01 and 1UG3NS132105-01. H.Z.J. was supported by NIH/NHGRI R01HG010149. M.J.P.C. and B.G. were supported by R01HG011649 and 5U24HG007497, respectively. J.P. was supported in part by HG010149. Certain commercial equipment, instruments or materials are identified to adequately specify the experimental conditions or reported results. Such identification does not imply recommendation or endorsement by the National Institute of Standards and Technology, nor does it imply that the equipment, instruments or materials identified are necessarily the best available for the purpose.
Author information
Authors and Affiliations
Contributions
A.C.E. performed data analysis and software development. E.D., H.Z.J., N.D.O., S.K.M., J.P., B.G., J.W., M.G. and M.J.P.C. contributed to testing and data processing. A.C.E., J.M.Z. and F.J.S. designed the study. A.C.E., E.D., H.Z.J., N.D.O., S.K.M., J.P., W.D.C., M.A.E., B.G., J.W., M.G., M.J.P.C., J.M.Z. and F.J.S. reviewed and edited the paper.
Corresponding authors
Ethics declarations
Competing interests
F.J.S. receives research support from Illumina, Genentech, PacBio and ONT. E.D. and M.A.E. are employees and shareholders of PacBio. S.K.M. is an employee and shareholder of ONT. W.D.C. has received free consumables from ONT. The other authors declare no competing interests.
Peer review
Peer review information
Nature Biotechnology thanks the anonymous reviewer(s) for their contribution to the peer review of this work.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Supplementary Information
Supplementary Figs. 1–7, Methods and Tables 1–3, 6 and 8–13.
Supplementary Tables
Supplementary Tables 4 (assembly sources), 5 (assembly statistics), 7 (replicate tiers) and 14 (pathogenic and phenotypic TRs).
Supplementary Material 1
Laytr HTML report for TRGT.
Supplementary Material 2
Laytr HTML report for Sniffles.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
English, A.C., Dolzhenko, E., Ziaei Jam, H. et al. Analysis and benchmarking of small and large genomic variants across tandem repeats. Nat Biotechnol (2024). https://doi.org/10.1038/s41587-024-02225-z
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41587-024-02225-z