Background & Summary

Rhododendron L. is the largest genus in the Ericaceae, and is well known for its ornamental value, ecological significance and cultural importance. More than 600 species of Rhododendron have been recorded from China1,2, making it the largest genus of seed plants in the country3. However, despite the high Rhododendron species diversity in China, many Rhododendron species have recently been threatened due to climate change, habitat loss and disturbance from human activities4,5. According to The Red List of Rhododendrons, of the 665 taxa (including infraspecific taxa) native to China, 15 species are Critically Endangered (CR), 18 are Endangered (EN) and 180 are Vulnerable (VU). These threatened species account for 32.03% of Rhododendron species diversity in China, and there are further 183 species for which there are insufficient data to assess their conservation status (Data Deficient, DD) meaning that this genus is in urgent need of detailed investigation and conservation in China6.

Rhododendron vialii Delavay & Franch. is an evergreen shrub belonging to the subgenus Azaleastrum Planch. ex K. Koch (Fig. 1a), and it has the only red tubular-funnelform corolla of this subgenus. According to specimen records, R. vialii was once extremely abundant, but has become gradually rarer in recent years due to deforestation, overcollection and habitat destruction7. R. vialii was evaluated as VU in The Red List of Rhododendrons6, and it is also included on the 2021 edition of the List of Yunnan Protected Plant Species with Extremely Small Populations (PSESP), which is part of a project for rescuing the most threatened plant species in China8,9.

Fig. 1
figure 1

Morphological characters (a) and features of the haplotype-resolved genome assembly (b) of R. vialii. From the outer ring to the inner ring are the distributions of pseudochromosomes, class I TE density, class II TE density, protein-coding gene density, proportion of tandem repeat, GC content and syntenic blocks.

In this study, we sampled an ex-situ conserved R. vialii and successfully assembled a high-quality haplotype-resolved, nearly telomere-to-telomere genome by combining PacBio high-fidelity (HiFi) reads and high-throughput chromosome conformation capture sequencing (Hi-C) reads. The final assembled genome contains haplotype A (532.73 Mb) and haplotype B (521.98 Mb), with contig N50 length of 35.67 Mb and 34.70 Mb, respectively (Fig. 1b). Based on the karyotype (2n = 26)10, approximately 99.92% of the assembled data was anchored to pseudochromosomes, and our final assembly contains 14 gapless assembled chromosomes.

We ran a Benchmarking Universal Single-Copy Orthologs assessment (BUSCO11, v. 5.3.2) with the lineage dataset embryophyta_odb10. The complete BUSCOs (including single-copy and multi-copy) of the two haplotypes accounted for 98.5% and 98.1%, respectively, indicating good completeness of the genome. In addition, we annotated 551.06 Mb (52.19% of the whole genome) corresponding to repeat elements. A total of 60,926 protein-coding genes were identified, of which 93.82% could be functionally annotated. This is the first report of a R. vialii genome sequence, and we believe that it will provide an important resource allowing us to explore the mechanisms underlying threats to this species, as well as its evolutionary history and further utilization on ornamental horticulture.

Methods

Sampling

For genomic DNA extraction, fresh young leaves of R. vialii were collected from a single adult plant in Kunming Botanical Garden, Kunming Institute of Botany, Chinese Academy of Sciences. We also collected roots, branches, leaves, buds and fruits for transcriptome sequencing. These materials were frozen directly in liquid nitrogen and were then transferred to –80°C for preservation. The related sequencing services were performed by a commercial sequencing provider (Wuhan Benagen Technology Co. Ltd. Wuhan, China).

Genome sequencing

A modified CTAB methods was performed to extract total DNA from young R. vialii leaves12. The concentration of DNA was assessed using NanoDrop (NanoDrop Technologies, Wilmington, DE, USA) and a Qubit 3.0 fluorometer (Life Technologies, Carlsbad, CA, USA). 1% agarose gel electrophoresis was then used to assess the purity and integrity of the resulting DNA. The short-read library with a DNA-fragment insert size of 200–400 bp was prepared using 1 μg genomic DNA following the manufacturer’s instructions (BGI) and was subject to paired-end (PE) sequencing on a DNBSEQ-T7 platform (BGI lnc., Shenzhen, China) using a PE 150 model. This produced 97.39 Gb (~649 M reads) of raw data, meaning approximately 150× genome coverage (Supplementary Table 1).

Before long-read sequencing, the DNA was purified using a DNeasy Plant Mini Kit (Qiagen, Germantown, MD, USA), and the integrity of the DNA was evaluated with a Femto Pulse (Agilent Technologies, Santa Clara, CA, USA). Megaruptor 3 (Diagenode SA., Seraing, Belgium) was used to shear 8 μg genomic DNA, and these fragments were then concentrated using AMPure PB magnetic beads (Pacific Biosciences, Menlo Park, CA, USA). Each PacBio single molecule real-time (SMRT) library was prepared using a SMRT bell express template prep Kit 2.0 (Pacific Biosciences, Menlo Park, CA, USA), with insert sizes of 15 kb selected using BluePippin system (Sage Science, Beverly, MA, USA). The library was sequenced on a Pacific Bioscience Sequel II platform in CCS mode, and the raw data was converted into high-precision HiFi reads using the CCS workflow13 (v. 6.3.0) with the standard parameters. From this process, we obtained 32.88 Gb (~60×) of HiFi data with an average read length and N50 read length of about 18 kb (Supplementary Table 2).

Hi-C library construction and sequencing

Hi-C libraries were prepared following a modified protocol from Belton et al.14. Fresh leaf tissue was fixed in 2% formaldehyde solution, and the cross-linked DNA was then digested with DpnII. Biotin-labeled adapters were attached at the sticky ends of the digested fragments to form chimeric junctions, which were enriched and trimmed to fragments of about 450 bp for pair-end sequencing on a DNBSEQ-T7 platform. Approximately 75 Gb (~500 M reads) of Hi-C data was generated for subsequent pseudochromosome assist assembly (Supplementary Table 3).

Transcriptome sequencing

Materials for transcriptome sequencing were homogenized and total RNA was extracted using R6827 Plant RNA Kit (Omega Bio-Tek, Norcross, GA, USA) following the manufacturer’s instructions. Subsequently, SQK-PCS109 and SQK-PBK004 Kits (Oxford Nanopore Technologies, Oxford, UK) were combined to prepare the library, and the library was sequenced using a Nanopore PromethION sequencer. Finally, a total of 10.36 Gb (~10.89 M reads) full-length RNA-seq data were obtained for genome annotation (Supplementary Table 4).

Preliminary genome survey

The preprocessor, fastp15 (v. 0.19.3) was used to filter out the adapter sequences, overly short reads and low-quality reads from the next generation sequencing data using the default parameters. Jellyfish16 (v. 2.2.10) was then used to calculate the frequency distribution of the depth of clean data with 19-mers, and the basic features of the genome were estimated with the software GCE17 (v. 1.0.0). The estimated genome size of R. vialii is about 525.63 Mb, with a heterozygosity of 0.89% and a duplication rate of 43.47% (Supplementary Table 5 and Supplementary Fig. 1).

Chromosome-level genome assembly

HiFi reads and Hi-C short reads were used as a combined input for the genome assembler Hifiasm18 (v. 0.16.1-r375) and were assembled into a pair of haplotype-resolved assembly contigs (haplotype A and haplotype B) in Hi-C mode with the default parameters. Juicer19 (v. 1.5.6) was then used to map clean Hi-C reads to the contigs, and Hi-C-assisted initial chromosome assembly was conducted using the 3D-DNA20 (v. 180922) algorithm with the standard procedure. Chromosome boundaries were adjusted and the scaffold corrected using the manually operated Juicebox21 (v. 1.11.08) module, and the generated file was used as input for 3D-DNA for re-scaffolding by chromosome. Juicebox was used again for re-quality control and adjustment of mis-joins and orientation to generate chromosome scaffolds and un-anchored sequences. Additionally, TGS-GapCloser22 (v. 1.1.1) was employed to fill gaps of the genome based on the HiFi reads.

Because some of the telomere assemblies were incomplete or missing, Minimap223 (v. 2.24-r1122) was used to map the HiFi reads to the chromosome, and the reads aligned to the positions of the telomeres were assembled into contigs using Hifiasm. These contigs were then mapped to the chromosomes to extend the chromosomal ends. GetOrganelle24 (v. 1.7.5) was used to assemble the chloroplast and mitochondrial genomes. After the above steps were completed, the software Nextpolish25 (v. 1.3.1) was employed to polish the assembly based on the short reads from two iterations, and Redundans26 (v. 0.13c) was used to remove redundancies such as haplotigs and rDNA fragments. Overall, about 99.92% of the assembled data was anchored to pseudochromosomes in the two haplotypes (Supplementary Table 6), and the chromosome number was set based on the results of the karyotype analysis (2n = 26). Finally, we obtained a high-quality haplotype-resolved gapless genome of R. vialii.

The assembly (approx. 1.05 Gb) contained two complete haplotypes, haplotype A and haplotype B, with genome sizes of 532.73 Mb and 521.98 Mb, respectively (Table 1). The genome size previously estimated based on K-mers was similar to these assemblies, with the main deviations in rDNA array. The contig N50 length of haplotype A and haplotype B were 35.67 Mb and 34.70 Mb, respectively. The number of gaps was fewer than 10 in both haplotypes (Table 1), and 14 gapless chromosomes were assembled (Supplementary Table 6), indicating good continuity of assembly.

Table 1 Summary of the Rhododendron vialii genome assembly data.

Identification of repetitive elements

EDTA27 (v. 1.9.9; parameters: --sensitive 1 --anno 1) was used for de novo identification of transposable elements to generate the TE library, and RepeatMasker28 (v. 4.0.7) was employed to detect repetitive elements in the assembled genome of R. vialii with the default parameters. We identified a total of 1,534,208 repetitive sequences (~ 551.06 Mb), accounting for 52.19% of the assembled genome, of which long terminal repeats (LTRs) and terminal inverted repeats (TIRs) had the highest proportions, accounting for 27.18% and 17.25% of the genome, respectively (Supplementary Table 7).

Gene identification and functional annotation

Transcriptome assembly was based on multiple strategies: (i) next-generation RNA-seq data was downloaded from the National Center for Biotechnology Information (NCBI) Sequence Read Archive (SRA) database (SRR13338561)29 for de novo assembly using Trinity30 (v. 2.0.6); (ii) short-reads were aligned to the reference genome using Hisat231 (v. 2.1.0) and then assembled using StringTie32 (v. 1.3.5); (iii) Minimap2 was used to map the long-reads to the genome and StringTie was employed for further assembly. Ultimately, PASA33 (v. 2.4.1) was used to combine and optimize the transcriptomes obtained by the above methods and to generate a high-quality transcriptome with a total length of 122.34 Mb and 114,558 transcripts (Supplementary Table 8). The publicly available combined 285,362 non-redundant protein sequences (including R. griersonianum34, R. molle35, R. delavayi36, R. simsii37, R. williamsianum38, R. ovatum39, R. henanense subsp. lingbaoense40, Vaccinium macrocarpon41, Actinidia chinensis42, Camellia sinensis43, Camptotheca acuminata44, Coffea canephora45 and Vitis vinifera46) were used as homologous protein evidence for this gene annotation. The PASA process was used to annotate the structure of the genomic genes based on transcript evidence, and the full-length genes were detected by comparison with reference proteins. The parametric model of AUGUSTUS47 (v. 3.4.0) was trained with the full-length gene set for five rounds of optimization.

The MAKER248 (v. 2.31.9) pipeline was used for annotation based on ab initio prediction, transcript evidence and homologous protein evidence. Briefly, AUGUSTUS was used for ab initio protein-coding gene prediction after masking the repetitive regions of the genome with RepeatMasker. The transcripts were then aligned to the repeat-masked genome using BLASTN and TBLASTX, while BLASTX was employed to aligned protein sequences to the genome. The previous results were then optimized using Exonerate49 (v. 2.2.0) and the hints files were generated based on the results. Integration of the predicted gene models was conducted using AUGUSTUS and UTR annotations were added according to the EST evidence.

Considering the relatively low accuracy of annotation results from the MAKER2 process, we also used EVidenceModeler50 (EVM, v. 1.1.1) to integrate the predictive results obtained from the PASA and MAKER2 annotations. To avoid introducing TE coding regions, TEsorter51 (v. 1.4.1) was used to identify the TE protein domains on the genome and EVM was used to mask them. Additionally, PASA was used to optimize the results obtained by EVM, and UTR sequences and alternative splicing were added, and overly short and abnormal gene annotations were removed. For non-coding RNAs, we used tRNAScan-SE52 (v. 2.0.7) to identify the tRNAs and Barrnap (v. 0.9)53 to detect the rRNAs. RfamScan54 (v. 14.2) was used to annotate various non-coding RNAs. Finally, all annotation results were merged to remove the redundancy and a complete gene set was obtained.

Overall, a total of 60,926 protein-coding genes have been successfully predicted, with an average length of 5,378.1 bp. Among them, there were 79,618 coding sequences (CDS) and 460,225 exons, and the mean length was 1,275.7 bp and 328.4 bp, respectively (Table 2). In addition, we also identified 5,538 non-coding genes, which contained 3,039 rRNAs, 770 tRNAs and 1,729 other ncRNAs.

Table 2 Gene annotation statistics.

For the functional prediction of protein-coding genes, three strategies were used. The predicted genes were aligned with the eggNOG v. 5.0 homologous gene database using eggNOG-mapper55 (v. 2.0.1) for Gene Ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEEG) annotation. The protein sequences were matched to the protein databases, including Swiss_Prot, TrEMBL, NR (non-redundant protein) and the Arabidopsis database using DIAMOND56 (v. 2.0.4; Identity > 30%, E-value < 1e-5) to determine the best alignment of the genes. To obtain the conserved amino acid sequences, motifs and domains of the predicted proteins, InterProScan57 (v. 5.14–53.0) was used to search for similarity of domain according to the sub-databases PRINTS, Pfam, SMART, PANTHER and CDD of the InterPro database. Finally, 57,225 genes were functionally annotated in at least one of the above databases, accounting for 93.82% of the predicted protein-coding genes (Supplementary Table 9).

Chromosomal synteny analysis

Minimap2 was used to compare the assembly of R. vialii with the previously published R. griersonianum genome, which showed that they had essentially the same order of chromosomes (Supplementary Fig. 2a). Meanwhile, the dot-plot of syntenic blocks between the two haplotypes showed a similar result (Supplementary Fig. 2b).

To explore the differences between the two haplotypes, the NUCmer module embedded in MuMmer 458 was used to align the whole genome, and the command delta-filter was used to remove short and low-quality alignments. After the subprogram show-coords was used for format conversion, we ran SyRI59 (v. 1.6) to detect variations. A total of 3,695 syntenic regions (~458 Mb) were detected, indicating high similarity between the two haplotypes. However, we also identified many variations, including 3,152,185 SNPs and 214,391 small insertions/deletions (indels, < 500 bp). We additionally identified 2,157 and 1,486 duplications in haplotypes A and B, respectively. Other differences between the two assemblies included 1,108 translocations and 48 inversions. Notably, two relatively large inversions of more than 1 Mb were found on Chr1 and Chr10, respectively (Supplementary Fig. 3).

Data Records

The relevant data reported in this paper have been deposited in the National Genomics Data Center (NGDC)60,61, Beijing Institute of Genomics, Chinese Academy of Sciences/China National Center for Bioinformation, under the BioProject accession number PRJCA015878 that is publicly accessible at https://ngdc.cncb.ac.cn/gwh. BGI short-reads, PacBio HiFi long-reads, Hi-C reads and Iso-Seq data have been deposited in the Genome Sequence Archive (GSA) in NGDC under the accession number CRR71964762, CRR71964663, CRR71964564 and CRR71964865. The final chromosome assembly and annotation data were deposited in the Genome Warehouse (GWH) in NGDC under the accession number GWHCAXW0000000066. The sequence data were also deposited in the SRA database with accession number SRR2450194867 SRR2450194968 SRR2450194769 and SRR2450194670 under the BioProject accession number PRJNA971245. And the final genome assembly has also stored in the GenBank with accession number GCA_030253575.171 and GCA_030253555.172.

Technical Validation

Evaluation of the assembled genome

The BUSCO analysis with the lineage dataset embryophyta_odb10, which was conducted to assess genome completeness, showed that, for 1,614 expected genes from the embryophyta, the proportions of complete BUSCOs (including single-copy and multi-copy) of these two haplotypes were 98.5% and 98.1%, respectively, indicating good genomic completeness (Table 3). Short-reads and long-reads were mapped to the genome with BWA73 (v. 0.7.17-r1188) and Minimap2, and the short RNA-seq reads were mapped to the assembly using Hisat2. After filtering out non-primary alignments, the map ratio and coverage of reads were calculated. We found that different types of sequencing data had a high genome coverage (Table 4). The distribution of coverage depth of these sites on the genome matched the Poisson distribution and there was no obvious heterozygous peak. The BUSCO single-copy and multi-copy genes have approximately the same depth, indicating that the assembly had no redundancy (Fig. 2).

Table 3 Statistics of BUSCO evaluation of the two haplotypes and proteins.
Table 4 Statistics of map rate and coverage of different types of sequencing reads.
Fig. 2
figure 2

The distribution of coverage depth of the genome (left) and BUSCO core region (right) evaluated by the next-generation data (upper) and HiFi data (lower).

To assess single base error rate and heterozygosity, the next-generation reads were mapped to the genome using BWA, and the upstream data was input to bcftools74 (v. 1.11) for variant detection. The calculated heterozygosity was approximately 0.0038% based on the heterozygous sites, and the single base error rate was about 0.00021% based on the homozygous loci. There was no obvious guanine-cytosine (GC) bias in the coverage depth analysis based on second and third generation data under different GC content (Fig. 3). Hi-C reads were mapped to the final version of the assembly using Juicer. In the Hi-C heatmap, there were strong interactive signals of the 13 chromosomes around the diagonal, suggesting that the two assemblies were without obvious chromosome assembly errors (Fig. 4).

Fig. 3
figure 3

Coverage depth distribution of HiFi data (a) and the next-generation data (b).

Fig. 4
figure 4

Hi-C interactive heatmap (with a resolution of 100 kb and the minimum mapping quality was one) of the two haploids. The strength of the interaction was represented by the color from yellow (low) to red (high).

The repetitive sequences identified by RepeatMasker were mapped to the genome to determine the position of the telomeres and other characteristic sequences on the chromosomes. Most of the chromosomes assembled complete telomere sequences (TTTAGGG), and only a few telomeres were missing or incomplete. All chromosomes contained a high tandem repeat (TGGTACCGTATGGATGTACTCGTACGGTATTGTACCGTTTTGGTGTGGTT), which is probably the centromere. In addition, the 18-5.8-28S rDNA and 5S rDNA arrays were detected on Chr11 and Chr10 respectively (Fig. 5). In summary, this assembly can by described as a nearly telomere-to-telomere genome.

Fig. 5
figure 5

Bar-plot of density and distribution of several repeated elements on chromosome sequences of R. vialii.

Evaluation of the gene annotation

The annotated and integrated proteins were also evaluated using BUSCO with the lineage dataset embryophyta_odb10. Briefly the proportion of complete core gene coverage was 98.1% (including 4.3% single-copy genes and 93.8% duplicated genes) and there were only a few fragmented (0.4%) and missing (1.5%) genes (Table 3), indicating that this annotation was of high quality.