Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Brief Communication
  • Published:

Accelerating minimap2 for long-read sequencing applications on modern CPUs

A preprint version of the article is available at bioRxiv.

Abstract

Long-read sequencing is now routinely used at scale for genomics and transcriptomics applications. Mapping long reads or a draft genome assembly to a reference sequence is often one of the most time-consuming steps in these applications. Here we present techniques to accelerate minimap2, a widely used software for this task. We present multiple optimizations using single-instruction multiple-data parallelization, efficient cache utilization and a learned index data structure to accelerate the three main computational modules of minimap2: seeding, chaining and pairwise sequence alignment. These optimizations result in an up to 1.8-fold reduction of end-to-end mapping time of minimap2 while maintaining identical output.

This is a preview of subscription content, access via your institution

Access options

Buy this article

Prices may be subject to local taxes which are calculated during checkout

Fig. 1: Work distribution for three modules.
Fig. 2: Performance comparison of minimap2 and mm2-fast on a single socket Cascade Lake CPU (28 cores) for full datasets.

Similar content being viewed by others

Data availability

Datasets used for benchmarking are publicly available (Supplementary Table 2). Human reference genome is available at https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/release/references/GRCh38/GCA_000001405.15_GRCh38_no_alt_analysis_set.fasta.gz. All ONT and PacBio HiFi datasets (HG002, HG003, HG004) used are available at https://precision.fda.gov/challenges/10/view. Datasets for PacBio CLR (HG002, HG003, HG004) are available at https://github.com/genome-in-a-bottle/giab_data_indexes. Genome assemblies are available at: CHM13: NCBI (GCA009914755.3), HG002 (hap1) and HG002 (hap2) are publicly available at ref. 33. The speedup shown in the paper can also be realized with a smaller subset of the above datasets. Source Data are provided with this paper.

Code availability

The mm2-fast source code is available under the open source MIT license at https://github.com/bwa-mem2/mm2-fast. The particular version of mm2-fast used in this manuscript is publicly available at ref. 34. The scripts used for the experiments in the manuscript are available at ref. 35.

Optimization Notice: Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more information go to http://www.intel.com/performance. Intel, Xeon, and Intel Xeon Phi are trademarks of Intel Corporation in the US and/or other countries.

References

  1. Chaisson, M. J. et al. Multi-platform discovery of haplotype-resolved structural variation in human genomes. Nat. Commun. 10, 1–16 (2019).

    Article  Google Scholar 

  2. Conesa, A. et al. A survey of best practices for RNA-seq data analysis. Genome Biol. 17, 1–19 (2016).

    Google Scholar 

  3. Beyter, D. et al. Long-read sequencing of 3,622 Icelanders provides insight into the role of structural variants in human diseases and other traits. Nat. Genet. 53, 779–886 (2021).

  4. Rhie, A. et al. Towards complete and error-free genome assemblies of all vertebrate species. Nature 592, 737–746 (2021).

    Article  Google Scholar 

  5. De Coster, W., Weissensteiner, M. H. & Sedlazeck, F. J. Towards population-scale long-read sequencing. Nat. Rev. Genet. 22, 572–587 (2021).

  6. PromethION Brochure (Nanophore Technologies, 2021); https://nanoporetech.com/sites/default/files/s3/literature/PromethION-brochure.pdf

  7. Li, H. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics 34, 3094–3100 (2018).

    Article  Google Scholar 

  8. Guo, L., Lau, J., Ruan, Z., Wei, P. & Cong, J. Hardware acceleration of long read pairwise overlapping in genome sequencing: a race between FPGA and GPU. In 2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines 127–135 (IEEE, 2019).

  9. Zeni, A. et al. LOGAN: high-performance GPU-based X-drop long-read alignment. In 2020 IEEE International Parallel and Distributed Processing Symposium 462–471 (IEEE, 2020).

  10. Feng, Z., Qiu, S., Wang, L. & Luo, Q. Accelerating long read alignment on three processors. In Proc. 48th International Conference on Parallel Processing 1–10 (ACM, 2019).

  11. Roberts, M., Hayes, W., Hunt, B. R., Mount, S. M. & Yorke, J. A. Reducing storage requirements for biological sequence comparison. Bioinformatics 20, 3363–3369 (2004).

    Article  Google Scholar 

  12. Abouelhoda, M. I. & Ohlebusch, E. Chaining algorithms for multiple genome comparison. J. Discrete Algorithms 3, 321–341 (2005).

    Article  MathSciNet  Google Scholar 

  13. Jain, C., Gibney, D. & Thankachan, S. V. Co-linear chaining with overlaps and gap costs. Preprint at https://www.biorxiv.org/content/10.1101/2021.02.03.429492v2 (2021).

  14. Ho, D. et al. LISA: learned indexes for DNA sequence analysis. Preprint at https://arxiv.org/abs/1910.04728 (2020).

  15. Schneider, V. A. et al. Evaluation of GRCh38 and de novo haploid genome assemblies demonstrates the enduring quality of the reference assembly. Genome Res. 27, 849–864 (2017).

    Article  Google Scholar 

  16. Nurk, S., Koren, S., Rhie, A., Rautiainen, M. et al. The complete sequence of a human genome. Preprint at https://doi.org/10.1101/2021.05.26.445798 (2021).

  17. Cheng, H., Concepcion, G. T., Feng, X., Zhang, H. & Li, H. Haplotype-resolved de novo assembly using phased assembly graphs with hifiasm. Nat. Methods 18, 170–175 (2021).

    Article  Google Scholar 

  18. Payne, A. et al. Readfish enables targeted nanopore sequencing of gigabase-sized genomes. Nat. Biotechnol. 39, 442–450 (2021).

    Article  Google Scholar 

  19. Kovaka, S., Fan, Y., Ni, B., Timp, W. & Schatz, M. C. Targeted nanopore sequencing by real-time mapping of raw electrical signal with uncalled. Nat. Biotechnol. 39, 431–441 (2021).

    Article  Google Scholar 

  20. Zhang, H. et al. Real-time mapping of nanopore raw signals. Bioinformatics https://doi.org/10.1093/bioinformatics/btab264 (2021).

  21. Jain, C., Rhie, A., Hansen, N., Koren, S. & Phillippy, A.M. A long read mapping method for highly repetitive reference sequences. Preprint at https://www.biorxiv.org/content/10.1101/2020.11.01.363887v1.full (2020).

  22. Sedlazeck, F. J. et al. Accurate detection of complex structural variations using single-molecule sequencing. Nat. Methods 15, 461–468 (2018).

    Article  Google Scholar 

  23. Ren, J. & Chaisson, M. lRA: the long read aligner for sequences and contigs. Preprint at https://doi.org/10.1371/journal.pcbi.1009078 (2020).

  24. Kraska, T., Beutel, A., Chi, E.H., Dean, J. & Polyzotis, N. The case for learned index structures. In ACM International Conference on Management of Data 489–504 (ACM, 2018).

  25. Galakatos, A., Markovitch, M., Binnig, C., Fonseca, R. & Kraska, T. FITing-Tree: a data-aware index structure. In SIGMOD ’19: Proceedings of the 2019 International Conference on Management of Data 1189–1206 (ACM, 2019); https://doi.org/10.1145/3299869.3319860

  26. Ferragina, P. & Vinciguerra, G. The PGM-index: a fully-dynamic compressed learned index with provable worst-case bounds. PVLDB 13, 1162–1175 (2020).

    Google Scholar 

  27. Ding, J. et al. ALEX: An Updatable Adaptive Learned Index. In SIGMOD ‘20: Proceedings of the 2020 International Conference on Management of Data 969-984 (ACM, 2020). https://doi.org/10.1145/3318464.3389711

  28. Wu, Y., Yu, J., Tian, Y., Sidle, R. & Barber, R. Designing succinct secondary indexing mechanism by exploiting column correlations. In SIGMOD ’19: Proceedings of the 2019 International Conference on Management of Data 1223–1240 (ACM, 2019). https://doi.org/10.1145/3299869.3319861

  29. Kirsche, M., Das, A. & Schatz, M. C. Sapling: accelerating suffix array queries with learned data models. Bioinformatics 37, 744–749 (2021).

    Article  Google Scholar 

  30. Marcus, R. et al. Benchmarking learned indexes. In PVLDB Vol. 14, 1–13 (2021).

  31. Marcus, R., Zhang, E. & Kraska, T. CDFShop: exploring and optimizing learned index structures. In SIGMOD ’20: Proc. 2020 ACM SIGMOD International Conference on Management of Data 2789–2792 (ACM, 2020); https://doi.org/10.1145/3318464.3384706

  32. Suzuki, H. & Kasahara, M. Introducing difference recurrence relations for faster semi-global alignment of long sequences. BMC Bioinformatics 19, 33–47 (2018).

    Article  Google Scholar 

  33. Cheng, H., Concepcion, G., Feng, X., Zhang, H. & Li, H. Human Assemblies Evaluated in the Hifiasm Paper (Zenodo, 2020); https://doi.org/10.5281/zenodo.4393631

  34. Kalikar, S., Jain, C., Md, V. & Misra, S. mm2-fast Source Code Used in the Manuscript—Accelerating Minimap2 for Long-Read Sequencing Applications on Modern CPUs (Zenodo, 2022); https://doi.org/10.5281/zenodo.5888171

  35. Kalikar, S., Jain, C., Md, V. & Misra, S. Scripts Used for the Experiments in the Manuscript—Accelerating Minimap2 for Long-Read Sequencing Applications on Modern CPUs (Zenodo, 2022); https://doi.org/10.5281/zenodo.5884451

Download references

Acknowledgements

This work is supported in part by the National Supercomputing Mission (NSM) India under DST/NSM/R&D_HPC_Applications to C.J. The authors are grateful to H. Li for guidance and technical discussions on minimap2 and working with us to get our improvements integrated in a branch of minimap2 github repo.

Author information

Authors and Affiliations

Authors

Contributions

S.K. led the software implementation of mm2-fast. All authors contributed to algorithm design, experiments and manuscript preparation, and read and approved the final manuscript.

Corresponding authors

Correspondence to Saurabh Kalikar, Chirag Jain, Md Vasimuddin or Sanchit Misra.

Ethics declarations

Competing interests

S.K., V.M. and S.M. are employees of Intel Corporation.

Peer review

Peer review information

Nature Computational Science thanks Aydin Buluc, Zemin Ning and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Handling editor: Fernando Chirigati, in collaboration with the Nature Computational Science team.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Fig. 1 Minimap2 workflow depicting its three key modules – (i) seeding, (ii) chaining, and (iii) alignment – and mm2-fast optimizations.

The seeding stage identifies short fixed-length exact matches between a read and a reference sequence. Chaining stage selects an ordered subset of these exact matches (anchors) to form a chain. The final alignment stage computes base-level alignments for filling the gaps between adjacent anchors in these chains. Our optimizations to each of the modules are shown in the blue dotted rectangle.

Extended Data Fig. 2 Cross-platform performance of our optimizations for Rome, Skylake, Cascade Lake and Ice Lake architectures using single socket.

X-axis shows various query datasets and y-axis indicates the speedup achieved by mm2-fast over minimap2 – both running on the same CPU.

Source data

Extended Data Fig. 3 Data structures used for hash table.

Minimizers extracted from the reference sequence are stored in a sorted list as key-value pairs. Position list maintains a separate list of the positions of minimizers on the reference sequence.

Extended Data Fig. 4 Two-layer RMI.

An example minimizer lookup is illustrated - get_mm_hits(mm5) calls a lookup for a minimizer mm5. The RMI root predicts the leaf layer model which in turn predicts the location of mm4 in the sorted list. Finally, the last mile search from mm4 walks to the location of mm5 and returns its value to the caller.

Extended Data Fig. 5 Chaining of two co-linear anchors A and B.

Here two anchors overlap on the query sequence. Gap cost function in minimap2 is calculated using the reference gap, query gap, and the average length of all anchors avg_qlen.

Supplementary information

Supplementary Information

Supplementary Tables 1–4, Figs. 1 and 2, Algorithms 1 and 2, and Sections 1 and 2.

Peer Review Information

Supplementary Data 1

Source data showing the single-threaded and multithreaded runtime of mm2-fast.

Supplementary Data 2

Source data showing the time spent by mm2-fast and minimap2 in the chaining module.

Source data

Source Data Fig. 1

Source Data showing the time spent by mm2-fast and minimap2 in various modules.

Source Data Fig. 2

Source Data showing the end-to-end mapping time of mm2-fast and minimap2 on the full datasets.

Source Data Extended Data Fig. 2

Source data showing the speedups of mm2-fast on various architectures.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Kalikar, S., Jain, C., Vasimuddin, M. et al. Accelerating minimap2 for long-read sequencing applications on modern CPUs. Nat Comput Sci 2, 78–83 (2022). https://doi.org/10.1038/s43588-022-00201-8

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1038/s43588-022-00201-8

This article is cited by

Search

Quick links

Nature Briefing AI and Robotics

Sign up for the Nature Briefing: AI and Robotics newsletter — what matters in AI and robotics research, free to your inbox weekly.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing: AI and Robotics