Research Briefing
Published: 08 March 2024

Creating diploid assemblies from Nanopore and Illumina reads with hypo-assembler

Nature Methods volume 21, pages 560–561 (2024)Cite this article

1182 Accesses
7 Altmetric
Metrics details

Subjects

Diploid assembly is a difficult task that requires several types of genomic sequencing data, including — but not limited to — HiFi reads and parental sequences. Hypo-assembler, an assembly algorithm, uses high quality solid k-mers extracted from Illumina data alongside Nanopore reads to produce a high-quality diploid assembly using only Nanopore and Illumina data.

You have full access to this article via your institution.

The problem

Diploid genome assemblies currently require a substantial amount of sequencing data for reliable results, including particularly accurate HiFi reads¹ or parental data (normally generated from Illumina reads²). Existing methods rely on these data to resolve repetitive regions, find variances and phase them. However, in a practical setting, HiFi reads require a greater amount of genomic data than other sequencing technologies do and parental data are not always available.

Because of the above limitations, we chose to explore assembly methods that use Oxford Nanopore Technologies reads and Illumina paired-end short reads only. This combination requires fewer genomic data and is more cost-effective, which makes the overall pipeline more practical.

Current assemblies based on Nanopore reads tend to have lower accuracy and produce haploid genomes, which is mainly due to Nanopore reads having a lower accuracy compared to those of HiFi or Illumina. Although Illumina reads are very short (100× smaller than long reads on average), they are very accurate and can remedy the issues with the Nanopore assemblies. We therefore investigated whether there was a viable approach for creating accurate haplotype-resolved assemblies by combining these two types of data.

The solution

We developed a diploid-genome assembly pipeline that we call hypo-assembler. Hypo-assembler first creates a draft assembly using Nanopore reads; then, it uses Illumina reads to help to polish the draft assembly into a diploid assembly, increasing its quality while finding variances and phasing them. This polishing step has been packaged separately into the hypo-short (Illumina reads only) and hypo-hybrid (Illumina and Nanopore reads) polishers. As targeted genomes will have many repetitive regions, one major challenge for polishing the draft assembly lies in aligning the noisy Nanopore reads and the short Illumina reads on the draft assembly correctly. To address this challenge, we use solid k-mers from Illumina reads³, which are substrings of length k that are expected to occur exactly once within the target genome. An analysis of the reference genome CHM13 (ref. ⁴) showed that it is possible to obtain highly accurate solid k-mers from Illumina short reads.

Figure 1 shows the hypo-assembler pipeline, which polishes a Nanopore draft assembly into a diploid assembly. First, a rough assembly is obtained from the Nanopore long reads and the solid k-mers from the Illumina short reads. Using the uniqueness of solid k-mers, the nonrepetitive regions of the draft assembly are identified and polished. Additionally, the solid k-mers are used as anchors to achieve a more accurate alignment of both long and short reads, resulting in an improved consensus. This alignment process can be further enhanced by selectively limiting the usage of solid k-mers based on their frequency; a simple approach of using only k-mers with more than half the expected coverage ensures a focus on k-mers that are less likely to be haplotype-specific — thus facilitating the segmentation of the genome to finally produce two separate haplotypes. We validated this solution using the incomplete diploid assembly reference of HG002 (ref. ⁵) alongside reference-free validations such as YAK and showed that this solution produces high-quality diploid assemblies.

**Fig. 1: Overview of the hypo-assembler algorithm.**

Future directions

Our simple frequency analysis with solid k-mers resulted in major improvements to the assembly, showing that the potential of frequency analysis of k-mers is vast and still underexplored. Using k-mers of different length and different occurrence frequencies might be an interesting direction to further improve genome assembly.

Improvements in actual sequencing technology will markedly affect the genome assembly process. Nanopore has been releasing better platforms, increasing their accuracy and read length, and similarly better base callers that also increase their accuracy. With sufficiently accurate Nanopore data, it might be possible to forgo the Illumina reads and create a haplotype-resolved assembly from Nanopore data alone.

There are some limitations to our method. Hypo-assembler as an algorithm is still widely limited as it is locked to diploid genomes, although with sufficient sequencing data it might be possible to extend it to polyploid genomes. Hypo-assembler is also not able to assemble very long repetitive regions such as rDNA arrays. However, even with its current limitations, hypo-assembler is a practical tool that will help to improve the availability of diploid assemblies.

Joshua Casey Darian¹ & Wing-Kin Sung²

¹National University of Singapore, Singapore, Singapore; ²Chinese University of Hong Kong, Hong Kong, China.

Expert opinion

“Remaining base-level errors in nanopore assemblies and the inability to generate diploid representation are key disadvantages of the nanopore approach. This work demonstrates a substantial improvement in both aspects and has the potential to become a useful resource for future studies using nanopore sequencing.” An anonymous reviewer.

Behind the paper

The work behind our algorithm, hypo-assembler, started as a simple genome polisher: starting from a draft genome, we simply intended to improve its base-level quality. However, the more we tried to improve base-level quality, the more we started to appreciate that this is linked to assembly contiguity. Many base-level problems cannot be solved without resolving assembly problems, breaking off misassemblies, resolving duplicate contigs and joining off distant contigs. Finally, we ended up with an improvement pipeline that reassembles the genome. On another note, the process of this improvement intersects a lot with improvements in the CHM13 assembly, and we really appreciate the availability of such high-quality reference genomes that have enabled us to improve our algorithm. J.C.D.

From the editor

“Nanopore-based diploid assemblies often suffer from low quality as Nanopore reads are error prone. This work presents two polishers called hypo-short and hypo-hybrid to polish Nanopore-based draft genome by reducing base errors and improving phasing. They also provide a diploid genome assembly pipeline, hypo-assembler, that shows base and phasing accuracy comparable to HiFi-based assemblies.” Editorial Team, Nature Methods.

References

Cheng, H. et al. Haplotype-resolved assembly of diploid genomes without parental data. Nat. Biotechnol. 40, 1332–1335 (2022). This paper presents HiFiAsm, a HiFi-based haplotype-resolved assembler.
Article CAS PubMed Google Scholar
Koren, S. et al. De novo assembly of haplotype-resolved genomes with trio binning. Nat. Biotechnol. 36, 1174–1182 (2018). This paper presents TrioCanu, a trio-based haplotype-resolved assembler.
Article CAS Google Scholar
Ariyaratne, P. N. et al. PE-Assembler: de novo assembler using short paired-end reads. Bioinformatics 27, 167–174 (2011). This paper is a previous use of the idea of solid k-mers.
Article CAS PubMed Google Scholar
Nurk, S. et al. The complete sequence of a human genome. Science 376, 44–53 (2022). The paper presents CHM13, a complete human reference genome.
Article CAS PubMed PubMed Central Google Scholar
Jarvis, E. D. et al. Semi-automated assembly of high-quality diploid human reference genomes. Nature 611, 519–531 (2022). This paper presents the incomplete HG002 reference genome used in the evaluation.
Article CAS PubMed PubMed Central Google Scholar

Download references

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This is a summary of: Darian, J. C. et al. Constructing telomere-to-telomere diploid genome by polishing haploid nanopore-based assembly. Nat. Methods https://doi.org/10.1038/s41592-023-02141-1 (2024).

Rights and permissions

Reprints and permissions

About this article

Cite this article

Creating diploid assemblies from Nanopore and Illumina reads with hypo-assembler. Nat Methods 21, 560–561 (2024). https://doi.org/10.1038/s41592-023-02142-0

Download citation

Published: 08 March 2024
Issue Date: April 2024
DOI: https://doi.org/10.1038/s41592-023-02142-0

Creating diploid assemblies from Nanopore and Illumina reads with hypo-assembler

Subjects

The problem

The solution

Future directions

Expert opinion

Behind the paper

From the editor

References

Additional information

Rights and permissions

About this article

Cite this article

Constructing telomere-to-telomere diploid genome by polishing haploid nanopore-based assembly

Search

Quick links

Subjects

The problem

The solution

Future directions

Expert opinion

Behind the paper

From the editor

References

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Search

Quick links