Background & Summary

Date palm (Phoenix dactylifera) is considered as one of the most economically important fruit crops grown in arid lands of the Arabian Peninsula, the Middle East and North Africa. This evergreen tree is well-adapted to harsh desert conditions of long hot summers, little rainfall and low humidity. The United Arab Emirates (UAE) has the largest number of date palms in the world, and is considered among the top global exporters of dates1. On the other hand, date palm orchards in the UAE have recently been suffering from serious diseases caused by fungal pathogens2,3, including sudden death syndrome (SDS; also known as date palm wilt disease)4.

Although researchers have reported several Fusarium species that are associated with disease symptoms of SDS worldwide3,5,6,7, Fusarium oxysporum f.sp. cumini (Foc) DSM106834, F. proliferatum (Fp) DSM106835 and F. solani (Fs) DSM106836 are the causal agents of SDS on date palm in the UAE4. In North Africa, Bayoud is the most destructive fungal disease of date palm that is linked with F. oxysporum f.sp. albedinis (Foa)8,9. Fs was, however, found associated with declined date palm trees in Pakistan10. In the UAE, Fp was identified the main Fusarium spp. causing SDS in Saudi Arabia, Iraq, Jordan and Tunisia11,12,13,14.

The soil-borne filamentous fungus Fp is a plant pathogen that belongs to the family Nectiraceae from the division Ascomycota. Fp is part of the F. fujikuroi species complex (FFSC) that is composed of around 60 different phylogenetic species with phytopathological and clinical relevance15,16. As other Fusarium spp., Fp has the ability to produce the mycotoxin, fumonisin17,18. Fumonisins are carcinogenic, estrogenic and immune suppressive in mammals and may cause birth defects of the brain and spinal cord18,19. Other mycotoxins, such as beauvericin, enniatins and moniliformin, can also be produced by Fp and act as virulence factors and specific effectors to elicit resistance to SDS in date palm11,13,14.

Although SDS has been reported to negatively affect date palm plantations in the UAE and elsewhere, the genetic information of the causal agent is still meager. Therefore, we developed a whole genome sequencing of Fp DSM106835 using PacBio® to provide high throughput sequencing with highly accurate long HiFi reads. Here, we presented a highly contiguous and complete de novo genome assembly for Fp DSM106835, the main causal agent of SDS on date palm in the UAE, using PacBio HiFi long-reads and Omni-C data. The final genome is about 58.5 Mbp across 418 scaffolds, with a scaffold N50 of 4.4 Mbp and a Benchmarking Universal Single-Copy Orthologs (BUSCO)20 score of 99.2%. This genome adds a valuable resource for studying the evolutionarily relationships and elucidating the molecular mechanisms for host specificity to further improve our understanding of Fp DSM106835-date palm interaction.

Methods

Growth and culture maintenance of F. proliferatum DSM106835

The pathogen, Fp DSM106835, was previously isolated from date palm trees showing SDS symptoms from Al Wagan area in Al Ain, Abu Dhabi, UAE, grown and maintained in potato dextrose agar plates (PDA; Lab M Limited, Lancashire, UK) supplemented with 25 mg/L penicillin-streptomycin (Sigma-Aldrich Chemie GmbH, Taufkirchen, Germany) at 25°C4. Plates were subcultured every 14 days on PDA plates until pure Fp DSM106835 colonies were obtained. A flow scheme of the isolation and culturing of Fp DSM106835 can be found in Fig. 1.

Fig. 1
figure 1

Flow diagram of the isolation, genome sequencing and assembly of Fusarium proliferatum DSM106835. Date palm trees showing symptoms of SDS were used to establish a pure culture of F. proliferatum DSM106835. Spores produced by the fungal pathogen were observed under light microscopy and further used for HMW DNA extraction. Omni-C and HiFi SMRbell libraries were prepared for Illumina HiSeq-X (short-read sequencing) and PacBio® Sequel II (long-read sequencing), respectively. HiFi and Omni-C reads were merged to develop a long-read-only assembly where all chromosomes were present as single contigs without the introduction of artificial gaps (Courtesy of Illumina, Inc., Pacific Biosciences of California, Inc.). SDS, sudden death syndrome; HMW, high molecular weight.

DNA extraction and PacBio HiFi sequencing

High molecular weight (HMW) DNA was extracted by first scraping all visible fungal material from the Petri dish, which was then transferred to a 50-ml tube with 2-ml of autoclaved ddH2O, flash frozen to create a pellet of ~500 mg, and ground to become powder. In the ground sample, 10 ml of cetyltrimethyl ammonium bromide (CTAB) and 100 µl of β-mercaptoethanol (BME) were added and incubated at 68°C for 15 minutes. After incubation, 10 µl of protease and 1 µl of RNase were added to the sample and incubated at 60°C for 30 minutes. Phenol/chloroform/isoamyl-alcohol was used to extract DNA from the cell lysate, which was then centrifuged into a pellet. The formed pellet was resuspended in 200 µl Tris-EDTA buffer (TE buffer). DNA samples were first sequenced using the PacBio Sequel II sequencer at Dovetail Genomics (Scotts Valley, California, USA). This sequencing step was carried out by preparing PacBio SMRTbell libraries (20 kbp) using the SMRTbell Express Template Prep Kit 2.0 (PacBio, Menlo Park, CA), according to the manufacturer’s protocol.

Omni-C sequencing

Omni-C sequencing is a chromatin conformation capture technology that allows the investigation of the genome’s three-dimensional (3D) organisation. The Omni-C library was prepared using the Dovetail® Omni-C® Kit according to the manufacturer’s protocol. Briefly, the chromatin was fixed with disuccinimidyl glutarate (DSG) and formaldehyde in the nucleus. The crosslinked chromatin was in situ digested with DNaseI.

After digestion, chromatin fragments attached to Chromatin Capture Beads were released by lysing the cells with sodium dodecyl sulfate (SDS) buffer. The chromatin ends were repaired followed by ligation to a biotinylated bridge adapter. After proximity ligation, crosslinks were reversed and DNA was purified. The sequencing librararies using Illumina-compatible adaptors were generated. Biotin-containing fragments were isolated using streptavidin beads before PCR amplification. The library was sequenced on an Illumina HiSeq-X platform. A flow scheme of HMW DNA extraction, library preparations and genome assembly of Fp DSM106835 can be found in Fig. 1.

De novo genome assembly

The genome assembly was carried out by first using 26.9 Gbp of PacBio Circular Consensus Sequencing (CCS) reads as an input to the hifiasm assembler21 with default parameters to create the initial de novo assembly. Omni-C sequencing resulted in a paired-end set of raw reads, each 11,489,515 bp in length and GC content of 49% (Table 1). These reads, along with the de novo assembly, were used as input data for HiRise22, a software pipeline explicitly designed for using proximity ligation data to scaffold genome assemblies (Fig. 2a). Dovetail Omni-C library sequences were aligned to the draft input assembly using BWA23, and pairtools24 was used to remove the PCR duplicates from the assembly; followed by SAMtools25 to generate the final bam file. Quality control using the script get_qc.py part of the HiRise package found 88,132,543 (76.71%) of read pairs were mapped and 12,232,575 (10.65%) were unmapped. The HiRise pipeline was used to identify misassemblies, and to break and sort scaffolds (only those above the threshold) in accordance with the likelihood model used by HiRise. Omni-C contact maps were created from the output of HiRise using Juicer26, and the contact map was configured to identify Topologically Associated Domains and A/B genome compartments. The configured contact map was visualised using Juicebox27 (Fig. 2b). The final de novo assembly of 58,468,907 bp in length had an N50 value of 4,383,091. This assembly was used as a query to perform a BLASTN28 search against the National Center for Biotechnology Information (NCBI) nucleotide database29 as an input for blobtools230 to visualise the assembly and its taxonomic partitioning (Fig. 2c). The HiCanu31 assembler was also used to assemble the genome to compare and validate the hifiasm assembly. The completeness of the final assembly was assessed using BUSCO with fungi_odb10 lineage-specific profile32.

Table 1 Information on the assembly of Fusarium proliferatum DSM106835.
Fig. 2
figure 2

Taxonomic partitioning, average read length of the raw data and Omni-C contact map of Fusarium proliferatum DSM106835. (a) The Cumulative length of scaffolds for the assembly; (b) Omni-C contact map showing the intensity of the physical interaction between genome regions; and (c) Taxonomic partitioning of F. proliferatum DSM106835 raw reads generated using blobtools2. In (b), the primary 10 chromosome-length scaffolds are highlighted in blue. In (c), blue represents Ascomycota while grey represents the reads with no-hits.

Transposable element analysis, gene prediction and annotation

The assembly of Fp DSM106835 was subjected to transposable element (TE) analysis using a customised repeat annotation pipeline. This pipeline incorporated multiple de novo TE discovery tools, including RepeatModeler33, HelitronScanner34, MITE Tracker35, SINEScan36, and RepeatMasker. In brief, RepeatModeler integrates RECON37, RepeatScout38, and LTRHavest/LTRretriver39. These tools obtained a comprehensive representation of TEs, leading to a relatively complete TE library. Subsequently, RepeatMasker was employed with this library to identify genome-wide TEs and mask all the repeats and tandem sequences. The resulting masked genome sequences were then subjected to de novo gene prediction and annotation using BRAKER 240. In the BRAKER 2 pipeline, Augustus41 was trained with protein sequences of orthologous genes in fungi genomes to help in gene prediction. The genome was then subjected to functional annotation and Gene Ontology (GO) analysis using Blast2GO42, and the prediction of secondary metabolites was performed using fungal-antiSMASH43.

Assessment of completeness and continuity of the genome assembly

For assembly continuity comparison, the genome sequences of seven Fp strains with gene annotations, ET1 (FJOF00000000)44, FFSC RH7 (JAJALB000000000)45, Fp_A8 (MRDB00000000)46, ITEM2341 (PKMI00000000)47, MPVP328 (PKMJ00000000)48, NRRL62905 (FCQG00000000)49, and R16 (PKMG00000000)50 were downloaded from the NCBI database. These strains were compared against Fp DSM106835 by comparing the sequence length of each assembly with the average scaffold length, and completeness analysis was performed by comparing the results of BUSCO analysis of each genome against fungi_odb10 lineage-specific profile.

Data Records

All sequence data, including raw HiFi long reads and Omni-C short reads, were deposited to the NCBI database under BioProject PRJEB64160, with accessions ERR1173347951 and ERR1173347852, respectively. The genome assembly is available through NCBI GenBank with the accession CAUHTQ00000000053. The genome annotation information was deposited in the Figshare database54.

Technical Validation

Evaluating the quality of the genome assembly

The PacBio sequencing produced 1,754,151 raw HiFi long reads with an average read length of 15,045.5 bp, resulting in 26.4 Gbp, mostly falling between 5,000–25,000 bp in length and approximately 560x coverage (Supplementary Fig. S1). By utilising the hifiasm and HiRise software, the assembly of HiFi reads with Omni-C reads generated 418 scaffolds, amounting to 58.47 Mbp. The N50 value was 4.38 Mbp. The largest 11 scaffolds had a combined size of 45.18 Mbp, which accounted for 77.3% of the entire genome (Table 1). Similar results were obtained when the assembly of HiCanu was compared to that using hifiasm (Supplementary Fig. S2). The assembly achieved a completeness rate of 99.2% for the 758 orthologous genes in fungi_odb10 using BUSCO, similar to the genome assembly of Fp strain Fp_A8 (99.3%; Table 1).

Genome annotation

A total of 3.96 Mbp of transposable repeat sequences were detected in the genome of Fp DSM106835, including retroelements (0.48 Mbp), DNA transposons (0.39 Mbp), rolling-circle replicates (Helitrons; 1.52 Mbp), and some unclassified repeats (1.56 Mbp), collectively constituting 6.76% of the total genome (Table 2; Fig. 3). Notably, the genome of Fp DSM106835 also included long terminal repeat (LTR) retroelements that belong to Gypsy superfamily. Heitron rolling-circle elements and unclassified elements accounted for a significant part of repeat sequences. The gene prediction using BRAKER245 resulted in 15,580 putative genes, of which 267 were TE and 15,313 were non-TE genes. We also detected 16,321 transcripts, where the average gene length was about 1,580 bp. After performing functional annotation on the predicted sequences, GO terms distribution for cellular components, molecular function, and biological processes was identified (Fig. 4a) with the highest number of annotations belonging to GO levels 3–7. The evidence code distribution was calculated, and mostly they received a hit from Inferred from Electronic Annotation (IEA) and Inferred from Biological aspect of Ancestor (IBA) sections (Fig. 4b). Similarly, the enzyme code (EC) classification was carried out, from which most of the sequences were found to be either transferases or oxidoreductases (Fig. 4c).

Table 2 Repeat sequence analysis of the genome of Fusarium proliferatum DSM106835.
Fig. 3
figure 3

Circos map of the 11 significant scaffolds for Fusarium proliferatum DSM106835. Outer track represents the ideogram of 11 scaffolds. The bin size of each track was 20 Kbp. LTR, long terminal repeats.

Fig. 4
figure 4

Functional annotation and Gene Ontology distribution for Fusarium proliferatum DSM106835. (a) Distribution of GO generated from the genome of F. proliferatum DSM106835; and (b) evidence code distribution for the obtained sequences. (c) EC classification for sequences present in the assembly; and (d) the number of secondary metabolite biosynthesis gene clusters identified from the first 11 scaffolds of the genome of F. proliferatum DSM106835. In (b), the distribution of evidence code for functional terms was obtained during the mapping step. GO, Gene Ontology; BP, biological process; MF, molecular function; CC, cellular component; EC, enzyme code.

The number of secondary metabolite biosynthesis gene clusters was also identified (Fig. 4d). In general, various gene clusters ranging from clinically relevant fumonisins, virulence-related ACT-Toxin II, and phytotoxic destruxin A were present in the genome. Gene clusters of secondary metabolites were found to belong to the biosynthesis of fusaric acid, oxyjavanicin, gibberellin, bikaverin, ACT-Toxin II, koraiol, Fujikurin A, α-acorenol, NG-391 and Gibepyrone A (Table 3).

Table 3 List of secondary metabolite biosynthetic gene clusters identified from the genome of Fusarium proliferatum DSM106835 using antiSMASH.

Genome continuity and completeness analysis

The continuity analysis revealed that Fp DSM106835 exhibited the highest continuity among the seven Fp strains collected from NCBI. The cumulative size of the first 10 scaffolds/contigs was 44.26 Mbp, which surpassed the assemblies of all other Fp strains ranging from 12.19 Mbp in Fp Fp_A8) to 36.19 Mbp in Fp ET1 (Fig. 5a). The same genomes were compared for their completeness using BUSCO19, and Fp DSM106835 achieved a completeness rate of 99.2% for the 758 orthologous genes in the Fungi_odb10 database, which is comparable to Fp Fp_A8 (99.3%; Fig. 5b).

Fig. 5
figure 5

Contiguity and completeness of the assembly of Fusarium proliferatum DSM106835. (a) Contiguity; and (b) completeness of the assembly of F. proliferatum DSM106835 compared with assemblies of seven F. proliferatum strains. In (a), only the first 20 longest scaffolds were presented.