Background & Summary

The order Decapoda represents one of the largest taxa within the subphylum Crustacea, encompassing at least 180 extant families and 14,756 species1. Giant river prawn (Macrobrachium rosenbergii), native to Indo-West Pacific from northwest India to Vietnam, Philippines, New Guinea and northern Australia, is the largest known palaemonid in the world2,3. M. rosenbergii inhabits tropical freshwater environments influenced by adjacent brackish water areas, exhibiting predominantly omnivorous feeding habits throughout its life cycle4. As an active predator of freshwater ecosystems, M. rosenbergii has been utilized as an indicator for water quality and metal accumulation assessment.

Certain decapod crustaceans, such as shrimp, crabs, and lobsters, are economically important aquatic species that contribute to global food production5. M. rosenbergii is one of the most important cultured decapod species, with the global production reaching 294 thousand tons in 20225,6. However, the decline in M. rosenbergii production from 2006 to 2012 raised concerns about inbreeding depression and a reduction in effective population size7,8. Moreover, the emergence of viral diseases poses a significant challenge to the sustainable development of M. rosenbergii aquaculture9,10. Decapod iridescent virus 1 (DIV1), a recently reported member of the Iridoviridae family, has been highlighted by the World Organisation for Animal Health (WOAH) due to its pathogenic risk to a range of economically important crustaceans11,12,13. It has been reported that DIV1 infect M. rosenbergii, leading to cumulative mortalities of over 80%14. Thus, breeding new strains is urgently needed for this important aquaculture species. Genome-wide association study (GWAS), based on high-quality genome assembly, aim to identify associations between phenotypic traits and genetic variants15. The trait-linked genetic variants identified by GWAS can be directly used as molecular markers in marker-assisted selection (MAS) and genomic selection (GS) of economically important species16. Whole-genome assembly provides essential genetic resources for developing molecular breeding programs and investigating of virus-host interaction in M. rosenbergii.

Here we assembled a high-quality chromosome-level genome of M. rosenbergii by integrating Nanopore, Illumina, and high-throughput chromosome conformation capture (Hi-C) sequencing technologies. The assembled genome size is 3.18 Gb with a scaffold N50 of 62.73 Mb. Approximately 3.13 Gb (98.6%) of assembled sequences were anchored to 59 pseudo-chromosomes. A total of 17,436 protein-coding genes were annotated. Benchmarking Universal Single-Copy Orthologs (BUSCO)17 evaluation showed that the final assembly achieved a completeness of 94.5%, with annotation completeness reaching 91%. This genome assembly will provide valuable resources for breeding programs and evolutionary studies of this species.

Methods

Sample collection and sequencing

One adult male M. rosenbergii, which was collected from the experimental ponds of Zhejiang Institute of Freshwater Fisheries in Huzhou, Zhejiang Province, China, was used for genome sequencing. High-quality DNA was extracted using a DNeasy Blood & Tissue Kit (Qiagen, Germany) in accordance with the manufacturer’s protocols. DNA quality and quantity were measured through standard agarose-gel electrophoresis and a Qubit 3.0 fluorometer (Invitrogen, USA), respectively. Nanopore sequencing libraries of M. rosenbergii were constructed and sequenced using the Nanopore PromethION platform (Oxford Nanopore Technologies, UK). A total of 296.45 Gb Nanopore reads were generated for genome assembly. For Illumina sequencing, short-insert paired-end (PE) (150 bp) DNA libraries of M. rosenbergii were constructed in accordance with the manufacturer’s instructions. Sequencing of PE libraries was performed (2 × 150 bp) on the Illumina NovaSeq. 6000 platform (Illumina, USA), resulting in a total output of 161.13 Gb paired-end sequencing data. Muscle sample of M. rosenbergii were collected to construct Hi-C libraries. Hi-C library was constructed using the previously published approach18, and sequenced with 2 \(\times \) 150 bp chemistry on the Illumina Novaseq 6000 platform. In total, 277.46 Gb of Hi-C sequencing data was generated to scaffold chromosomes.

Hepatopancreas, heart, gills, eyes, and muscle samples were collected from the M. rosenbergii specimen to construct sequencing libraries for strand-specific RNA-sequencing (RNA-seq). Total RNA was extracted with TRIzol reagent (Invitrogen, USA). The mRNA was enriched from total RNA using poly-T oligo-attached magnetic beads. rRNA was removed using a TruSeq Stranded Total RNA Library Prep Kit (Illumina, USA). A PE library was constructed using a VAHTSTM mRNA-seq V2 Library Prep Kit for Illumina (Vazyme, China) and sequenced (2 × 150 bp) using the Illumina HiSeq NovaSeq.6000 platform (Illumina, USA).

Genome size estimation

Low-quality reads (≥10% unidentified nucleotides and/or ≥50% nucleotides with a Phred score <5) and sequencing adapter-contaminated Illumina reads were filtered and trimmed with Fastp (v0.21.0)19. The genome size and heterozygosity were estimated using high-quality Illumina reads based on k-mer frequency distribution. The number of k-mers and peak depth of k-mer size at k = 17 were obtained using Jellyfish (v2.3.0)20 with the -C setting. The Jellyfish results were then input into GenomeScope2 (v1.0.0)21 to estimate genome size and heterozygosity rate. The genome size of M. rosenbergii was estimated to be 3,042,399,425 bp, with a heterozygosity ratio of 1.1% (Fig. 1a).

Fig. 1
figure 1

Genome assembly of M. rosenbergii. (a) Distribution of 17-mer frequency in M. rosenbergii genome. The genome size of M. rosenbergii was estimated to be 3.04 Gb. The heterozygous rate of M. rosenbergii was estimated to be 1.1%. (b) Concentric circle illustrates structural, functional, and evolutionary aspects of M. rosenbergii. a. GC content, b. Gene density, c. simple sequence repeat (SSR) density, d. Collinear regions detected within the genome.

Genome assembly

Low-quality Nanopore reads were filtered using a previously published Python script22. Three draft-genome assemblies were then generated using the filtered Nanopore reads with Wtdbg2 (v2.5)23, Flye (v2.7)24, and Shasta (v0.4.0)25, respectively. The contigs of the three draft assemblies were subjected to error correction using filtered Nanopore reads with Racon (v1.4.16)26. The corrected contigs were subsequently polished using high-quality Illumina reads with Pilon (v1.22)27. The N50 of the error-corrected contigs for the Wtdbg2, Flye, and Shasta assemblies were 0.66 Mb, 1.21 Mb and 1.22 Mb respectively. The error-corrected contigs of the three assemblies were merged into longer sequences using quickmerge (v0.3)28.

We used Hi-C to correct misjoins, to order and orient contigs, and to merge overlaps. Low-quality Hi-C reads were filtered using Fastp (v0.21.0)19. Filtered Hi-C reads were aligned to the assembled contigs using Juicer (v1.5.7)29. Scaffolding was accomplished using 3D-DNA pipeline (v180419)30. Juicebox (v2.16.0)29 was used to modify the order and direction of certain scaffolds in a Hi-C contact map and to help determine chromosome boundaries. Approximately 98.6% of the contig sequences were anchored to 59 chromosomes (Fig. 1b; Table 1). The longest and shortest chromosomes were 118 Mb and 0.73 Mb in length, respectively. The scaffold N50 reached 62.73 Mb for the final genome assembly (Table 1). BUSCO analysis with arthropod (obd10) gene set showed that the assembled M. rosenbergii genome contained 94.5% complete single-copy orthologs17 (Table 2).

Table 1 Genome assembly statistics of M. rosenbergii.
Table 2 BUSCO evaluation of M. rosenbergii genome assembly.

Genome annotation

Repetitive elements in the M. rosenbergii genome assembly were identified by de novo predictions using RepeatMasker (v4.1.0)31. RepeatModeler (v2.0.1)32 was used to build the de novo repeat libraries of M. rosenbergii. To identify repetitive elements, sequences from the assembly were aligned to the de novo repeat library using RepeatMasker (v4.1.0). Additionally, repetitive elements in the M. rosenbergii genome assembly were identified by homology searches against known repeat databases using RepeatMasker (v4.1.0). Finally, a total of 1.39 Gb (43.77%) repetitive sequences were identified in the M. rosenbergii genome (Table 3). Long interspersed nuclear elements (LINEs) and long terminal repeats (LTR) retrotransposons were the largest class of annotated transposable elements (TEs), accounting for 7.11% and 5.93% of the M. rosenbergii genome.

Table 3 Summary of annotated repeats in M. rosenbergii genome.

Protein-coding genes in the M. rosenbergii genome were predicted using RNA-seq-based prediction. RNA-seq reads of M. rosenbergii were aligned to the reference sequence using HISAT2 (v2.1.0)33. Gene models were predicted based on the alignment results of HISAT2 using StringTie (v2.1.4)34, and coding regions were identified using TransDecoder (v5.5.0) (https://github.com/TransDecoder/TransDecoder). In total, 17,436 protein-coding genes were identified in the M. rosenbergii genome. Completeness of the predicted gene models was evaluated using BUSCO (v5.0.1)17 against the conserved Arthropoda dataset (odb10). In the predicted gene models, BUSCO analysis identified 91.0% complete conserved single-copy arthropod genes (Table 4). To assign functions to the predicted proteins, we aligned the M. rosenbergii protein models against the NCBI nonredundant (NR) amino acid sequences, Swiss-Prot, and Translated EMBL-Bank (trEMBL) using Diamond (v0.9.24)35 with an E-value cutoff of 10−5. Protein models were also aligned against the InterPro database using InterProScan (v5.63)36. In total, 98.3% (17,140) of predicted protein models were functional annotated. Specifically, 15,735 protein models were annotated in the NCBI NR database, 11,766 protein models were annotated in the Swiss-Prot database, 16,471 protein models were annotated in the InterPro database, and 15,490 protein models were annotated in the TrEMBL database.

Table 4 BUSCO evaluation of predicted gene models of M. rosenbergii.

Data Records

Genomic Illumina sequencing data and Transcriptomic sequences can be accessed in the NCBI Sequence Read Archive with accession numbers SRR3012040137 and SRR2931455438. The annotation file of the M. rosenbergii genome has been deposited at figshare (https://figshare.com/articles/dataset/Macrobrachium_rosenbergii_genome_assembly_and_gene_annotation/26068237)39. The final assembled M. rosenbergii genome has been deposited in the NCBI GenBank with accession number GCA_040412425.140.

Technical Validation

The completeness of the M. rosenbergii genome assembly was first evaluated using BUSCO (v5.1.0)34 against the conserved Arthropoda dataset (obd10). The BUSCO analysis indicated that 94.5% of conserved single-copy arthropod genes were captured in the M. rosenbergii genome (Table 2). Merqury (v1.3)41 was subsequently used to assess the quality of the assembly. The consensus quality value (QV) and k-mer completeness of the assembly was 30.29, thus suggesting good quality (Table 5). Lastly, the quality of the genome annotation was evaluated using the BUSCO software, based on the arthropoda_odb10 datasets (Table 4). This assessment revealed that the final genome annotation encompassed 91% of the arthropoda_odb10 genes, demonstrating a high completeness rate in gene predictions.

Table 5 Statistics of assembly quality and completeness evaluation using Merqury.

To evaluate the reliability of genome assembly and annotation of M. rosenbergii, phylogenetic tree was constructed for M. rosenbergii and 11 arthropod species. Protein sequences of the 11 species (Drosophila melanogaster, Daphnia magna, Bathynomus jamesi, Macrobrachium nipponense, Litopenaeus vannamei, Penaeus japonicus, Procambarus clarkii, Cherax quadricarinatus, Scylla paramamosain, Portunus trituberculatus and Callinectes sapidus) were downloaded for phylogenetic analysis. Orthofinder (v2.5.4)42 was applied to identify and cluster gene families among 11 reference species and M. rosenbergii. Single-copy orthologs in each gene cluster were aligned using MAFFT (v7.310)43. The alignments were trimmed using trimAL (v1.4.rev15)44. The trimmed alignments of single-copy orthologs were concatenated using PhyloSuite (v1.2.3)45. The maximum likelihood (ML) tree was generated based on the concatenated alignments using IQ-TREE (v1.6.12)46, with D. melanogaster as the outgroup (Fig. 2). Branch support was estimated using both the SH-like approximate likelihood ratio test (SH-aLRT) and ultrafast bootstrap approximation47,48. Divergence time was estimated using the MCMCTree module in the PAML (v4.9)49 package. MCMCTree analysis was performed using the maximum-likelihood tree constructed by IQ-TREE as a guide tree, and calibrated with divergence times obtained from the and TimeTree database50 (minimum = 95 million years and soft maximum = 132 million years between P. japonicus and L. vannamei; minimum = 31 million years and soft maximum = 100 million years between P. trituberculatus and C. sapidus). M. rosenbergii and M. nipponense form a clade, with an estimated divergence time of approximately 120.63 million years ago (MYA) (CI: 57.58–201.66 MYA).

Fig. 2
figure 2

Species tree of 12 arthropod species with Drosophila melanogaster as the outgroup. Bootstrap values are listed in red next to each node. Divergence time between species pairs is listed beside each node, and 95% confidence interval of estimated divergence time is listed in the parentheses. MYA, million years ago.