The problem

Diploid genome assemblies currently require a substantial amount of sequencing data for reliable results, including particularly accurate HiFi reads1 or parental data (normally generated from Illumina reads2). Existing methods rely on these data to resolve repetitive regions, find variances and phase them. However, in a practical setting, HiFi reads require a greater amount of genomic data than other sequencing technologies do and parental data are not always available.

Because of the above limitations, we chose to explore assembly methods that use Oxford Nanopore Technologies reads and Illumina paired-end short reads only. This combination requires fewer genomic data and is more cost-effective, which makes the overall pipeline more practical.

Current assemblies based on Nanopore reads tend to have lower accuracy and produce haploid genomes, which is mainly due to Nanopore reads having a lower accuracy compared to those of HiFi or Illumina. Although Illumina reads are very short (100× smaller than long reads on average), they are very accurate and can remedy the issues with the Nanopore assemblies. We therefore investigated whether there was a viable approach for creating accurate haplotype-resolved assemblies by combining these two types of data.

The solution

We developed a diploid-genome assembly pipeline that we call hypo-assembler. Hypo-assembler first creates a draft assembly using Nanopore reads; then, it uses Illumina reads to help to polish the draft assembly into a diploid assembly, increasing its quality while finding variances and phasing them. This polishing step has been packaged separately into the hypo-short (Illumina reads only) and hypo-hybrid (Illumina and Nanopore reads) polishers. As targeted genomes will have many repetitive regions, one major challenge for polishing the draft assembly lies in aligning the noisy Nanopore reads and the short Illumina reads on the draft assembly correctly. To address this challenge, we use solid k-mers from Illumina reads3, which are substrings of length k that are expected to occur exactly once within the target genome. An analysis of the reference genome CHM13 (ref. 4) showed that it is possible to obtain highly accurate solid k-mers from Illumina short reads.

Figure 1 shows the hypo-assembler pipeline, which polishes a Nanopore draft assembly into a diploid assembly. First, a rough assembly is obtained from the Nanopore long reads and the solid k-mers from the Illumina short reads. Using the uniqueness of solid k-mers, the nonrepetitive regions of the draft assembly are identified and polished. Additionally, the solid k-mers are used as anchors to achieve a more accurate alignment of both long and short reads, resulting in an improved consensus. This alignment process can be further enhanced by selectively limiting the usage of solid k-mers based on their frequency; a simple approach of using only k-mers with more than half the expected coverage ensures a focus on k-mers that are less likely to be haplotype-specific — thus facilitating the segmentation of the genome to finally produce two separate haplotypes. We validated this solution using the incomplete diploid assembly reference of HG002 (ref. 5) alongside reference-free validations such as YAK and showed that this solution produces high-quality diploid assemblies.

Fig. 1: Overview of the hypo-assembler algorithm.
figure 1

The analysis pipeline starts from a rough initial assembly of the target genome, generated from Nanopore reads. (1) Solid k-mers are selected from the Illumina reads based on their frequency; (2) the regions represented by them in the assembly are then identified as solid regions; (3) short gaps between solid regions are fixed by filling them with the consensus sequences of each haplotype; (4) Nanopore reads are used to correct long insertions and deletions (Indels) in long gaps; (5) the consensus of each haplotype is used to fill in longer gaps between solid regions; and (6) reusing the Nanopore reads, haplotypes are phased (that is, categorized as paternal or maternal). SNP, single-nucleotide polymorphism; SV, structural variation. © 2024, Darian, J. C. et al.

Future directions

Our simple frequency analysis with solid k-mers resulted in major improvements to the assembly, showing that the potential of frequency analysis of k-mers is vast and still underexplored. Using k-mers of different length and different occurrence frequencies might be an interesting direction to further improve genome assembly.

Improvements in actual sequencing technology will markedly affect the genome assembly process. Nanopore has been releasing better platforms, increasing their accuracy and read length, and similarly better base callers that also increase their accuracy. With sufficiently accurate Nanopore data, it might be possible to forgo the Illumina reads and create a haplotype-resolved assembly from Nanopore data alone.

There are some limitations to our method. Hypo-assembler as an algorithm is still widely limited as it is locked to diploid genomes, although with sufficient sequencing data it might be possible to extend it to polyploid genomes. Hypo-assembler is also not able to assemble very long repetitive regions such as rDNA arrays. However, even with its current limitations, hypo-assembler is a practical tool that will help to improve the availability of diploid assemblies.

Joshua Casey Darian1 & Wing-Kin Sung2

1National University of Singapore, Singapore, Singapore; 2Chinese University of Hong Kong, Hong Kong, China.

Expert opinion

“Remaining base-level errors in nanopore assemblies and the inability to generate diploid representation are key disadvantages of the nanopore approach. This work demonstrates a substantial improvement in both aspects and has the potential to become a useful resource for future studies using nanopore sequencing.” An anonymous reviewer.

Behind the paper

The work behind our algorithm, hypo-assembler, started as a simple genome polisher: starting from a draft genome, we simply intended to improve its base-level quality. However, the more we tried to improve base-level quality, the more we started to appreciate that this is linked to assembly contiguity. Many base-level problems cannot be solved without resolving assembly problems, breaking off misassemblies, resolving duplicate contigs and joining off distant contigs. Finally, we ended up with an improvement pipeline that reassembles the genome. On another note, the process of this improvement intersects a lot with improvements in the CHM13 assembly, and we really appreciate the availability of such high-quality reference genomes that have enabled us to improve our algorithm. J.C.D.

From the editor

“Nanopore-based diploid assemblies often suffer from low quality as Nanopore reads are error prone. This work presents two polishers called hypo-short and hypo-hybrid to polish Nanopore-based draft genome by reducing base errors and improving phasing. They also provide a diploid genome assembly pipeline, hypo-assembler, that shows base and phasing accuracy comparable to HiFi-based assemblies.” Editorial Team, Nature Methods.