Background & Summary

Species of the genus Cotoneaster Medic. belong to the Malinae subtribe of the Rosaceae family1, and are primarily distributed in continental Eurasia, with a remarkable species diversity in the biodiversity hotspots of the Himalayas and the Hengduan Mountains (HDM)2. Taxonomic difficulties for this genus have been caused by various evolutionary events, including hybridization, polyploidization, and apomixis. A comprehensive phylogenetic analysis of this genus has been conducted using genome-skimming data, but with the genome of Eriobotrya japonica serving as the mapping reference3, which might introduce mapping errors, incorrect alignments, difficulties in identifying orthologous genes, and genome annotation issues.

Based on morphological characteristics and molecular evidences, two subgenera or sections have been proposed: Cotoneaster, characterized by predominantly red or pink flowers with erect petals, and Chaenopetalum, noted for its primarily white flowers with spreading petals2,3,4,5. Notably, only approximately 10% of Cotoneaster species are diploid2. Cotoneaster glaucophyllus, as a representative member of the Chaenopetalum subgenus and a diploid species, has a distinct distribution in the southeastern of Hengduan Mountains and on the Yunnan-Guizhou Plateau. It is a semi-evergreen shrub that blossoms in late summer, exhibiting dense, showy, fragrant white flowers, and bears long-lasting fruits in early winter, potentially making it an important ornamental plants2,6,7. With continuous advancements in sequencing technology, abundant genome resources for numerous Rosaceae species have been extensively documented8,9,10,11,12. However, the lack of whole-genome sequencing in Cotoneaster species has been a significant obstacle in further understanding the gene functions, evolutionary history, and conservation of this complicated genus (up to 370 species).

Using the Pacific Biosciences (PacBio) platform, we generated ~117 Gb of DNA continuous long reads (CLRs) and obtained ~48 Gb of full-length transcriptome sequences. Additionally, we sequenced ~104 Gb of DNA reads and ~10 Gb of RNA reads (2 × 150 bp) as well as ~62 Gb of high-throughput chromosome conformation capture (Hi-C) reads based on the Illumina HiSeq platform. With the aid of Hi-C technologies, we finally provided a high-quality genome sequence for the diploid species (2n = 2x = 34) of C. glaucophyllus (Fig. 1).

Fig. 1
figure 1

Photographs taken from the sampled plant (ad) of Cotoneaster glaucophyllus. (a) habit; (b) inflorescences with floral buds; (c) bloomed flowers, showing white filaments and purple anthers; (d) mature fruits.

Methods

Sample collections

Fresh leaves, fruits and roots were collected from an adult plant of C. glaucophyllus (Xiajinchang, Malipo County, Yunnan Province, China; 23°08′26.57″N, 104°48′34.54″E; a.l.s. 1959 m; Fan17545, SYS!). The samples were separately wrapped in foil paper on 28 September, 2019 (Fig. 1a,d). Immediately thereafter, they were frozen in liquid nitrogen and then were preserved in Drikold and sent to Novogene Bioinformatics Technology Co., Ltd (Beijing, China). On 15 June, 2020, we collected flower tissue from the same plant (Specimen: Fan17951, SYS!) (Fig. 1b,c).

DNA and RNA extraction and genome sequencing

Total DNA was extracted from fresh leaves using the Plant Genomic DNA Kit (DP305, Tiangen Biotech Co., Ltd., Beijing, China). The qualified DNAs were used to construct libraries intended for single molecular real-time (SMRT) sequencing using the Pacific Biosciences system (Menlo Park, CA, USA), Illumina sequencing, and Hi-C sequencing. The 20 kb library was prepared following the manufacturer’s protocol13. For the Illumina DNA paired-end library, the NEBNext® UltraTM DNA Library Prep Kit was utilized according to the provided instructions, with an insert size of 350 bp. The Hi-C library was prepared following standard procedures14.

Samples including fresh leaves, flowers, fruits, roots, and stems were pooled for total RNA extraction using the TIANGEN RNAPrep Pure Plant kit (DP432, Tiangen Biotech Co. Ltd., Beijing, China). Subsequently, the qualified RNAs were utilized for synthesizing full-length cDNAs with the SMRTer PCR cDNA Synthesis Kit (Biomarker, Beijing). Full-length transcriptome sequencing was performed on the PacBio Sequel platform. Additionally, short RNA-Seq reads (2 × 150 bp) specifically from leaf samples were generated and processed15 to facilitate the correction of the long-read RNA sequencing data and genome annotation.

PacBio long-read sequencing was performed using the PacBio Sequel system, while high throughput sequencing (2 × 150 bp) was carried out using an Illumina HiSeq sequencer. Both sequencing processes were conducted at Novogene Bioinformatics Technology Co., Ltd. (Beijing, China).

Pre-estimation of genomic characteristics

The generated Illumina sequencing data were primarily processed using the NGSQC Toolkit v2.3.316. This processing was involved in discarding reads that had adaptor contamination, reads with more than 10% unknown nucleotides (N), and paired reads that contained over 20% bases with a quality score of less than 5 in either read. Then, we performed a genome survey using Jellyfish v.2.2.717 with the default setting of k-mer = 17 (Fig. 2). Based on a kmer-based statistical approach, GenomeScope v.2.018 was used to estimate genome heterozygosity, repeat content, and size. To initially assess the genomic complexity, we employed SOAPdenovo v.2.0.419 to generate a de novo draft assembly using a k-mer length of 41. The assembled contigs were then utilized to calculate the guanine-cytosine (GC) content. The estimated genome size was determined to be 625.87 Mb, with a heterozygosity rate of 0.55% and a repeat sequence proportion of 54.97%. Moreover, the estimated GC content was 38.65%.

Fig. 2
figure 2

Frequency distribution of depth and K-mer numbers (A) and frequency distribution of depth and K-mer types (B).

Genome assembly and quality assessment

The FALCON assembler20 was initially employed to perform self-correction of PacBio subreads. Subsequently, preassembled reads were assembled using the overlap-layout-consensus (OLC) algorithm, resulting in consensus contigs. To enhance the accuracy of the results, high-quality contigs were further corrected using Illumina short DNA reads through Pilon21. Leveraging the clean Hi-C data, the LACHESIS tool22 was utilized to scaffold the assembly, ultimately yielding a chromosome-level assembly. The de novo genome assembly was 563.3 Mb in length, with a contig N50 of ~6 Mb and a scaffold N50 of ~31 Mb (Table 1).

Table 1 Statistics of genome assembly.

Among the 211 contigs, 124 were anchored to 17 pseudochromosomes (538.4 Mb, 95.59%) (Fig. 3, Table 2) and the remaining 87 were unanchored (24.9 Mb, 4.41%) (Table 2, Table S1). The GC content of these pseudochromosomes was ranging from 37.90% to 39.13% (Table 2).

Fig. 3
figure 3

Hi-C interaction heatmap within pseudochromosomes of Cotoneaster glaucophyllus.

Table 2 Summary of 17 pseudochromosomes and 87 contigs.

To comprehensively evaluate the reliability of the assembly, multiple assessments were performed in addition to considering the contig/scaffold N50 length. First, the integrity of the assembly was assessed by mapping the assembled genome to the BUSCO (Benchmarking Universal Single-Copy Orthologs) database v2.023 (BUSCO, RRID: SCR 015008) and the CEGMA v2.524 (Core Eukaryotic Genes Mapping Approach, RRID: SCR 015055). The BUSCO database contains 1,440 conserved core genes in terrestrial plants, while CEGMA includes a subset of the 248 most highly-conserved Core Eukaryotic Genes (CEGs). Second, the consistency between the assembly and paired-end Illumina short reads was evaluated by calculating the mapping and coverage rates. The Burrows‒Wheeler Aligner (BWA) v0.7.1525 was used to align the 150 bp short reads to the assembly. Thirdly, assembly accuracy was assessed by conducting SNP calling using SAMtools v1.926 and BCFtools v1.9 (https://github.com/samtools/bcftools) based on the above mapping results. The rates of homozygous and heterozygous single-nucleotide polymorphisms (SNPs) were also determined.

Genome annotation

We applied a combined strategy that utilized both de novo search and homology alignment to identify the repeats. A de novo repetitive element database was generated using LTR_FINDER v.1.0.627, RepeatScout v.1.0.528, Piler-DF v2.429, and RepeatModeler v.2.0.130 with the default parameters. The raw transposable element (TE) library included all repeat sequences that were longer than 100 bp and had less than 5% “N” gaps. To obtain a nonredundant library, a combined of Repbase31 and the raw TE library processing was conducted using uclust. Finally, RepeatMasker v.4.1.032 was employed for the repeat identification using the nonredundant library. The homology-based approach utilized RepeatMasker v.4.1.032 and the Repbase31 library to identify known transposable elements (TEs). These identified TEs were subsequently aligned with the genome sequences using a TE protein database, RepeatProteinMask v.4.1.032. Tandem repeats were predicted using Tandem Repeats Finder v.4.0933. In the genome assembly, 55.60% repeat sequences were identified, among which 4.19% were tandem repeat sequences and 50.33% were long terminal repeat retrotransposons (LTR-RTs) (Table 3).

Table 3 Summary of interspersed repetitive sequences.

Multiple approaches, including ab initio prediction, homology-based prediction, and full-length transcript evidence, were employed to annotate gene models. For ab initio gene predication based on ab initio, GeneWise v.2.4.134, Augustus v3.2.335, Geneid v1.436, Genescan v3.137, GlimmerHMM v3.0438, and SNAP39 were used. Homologous protein sequences of Malus x domestica40, Fragaria vesca41, Rosa chinensis42, Prunus persica43, Pyrus betuleafolia44, and Eriobotrya japonica12 were downloaded from NCBI (https://www.ncbi.nlm.nih.gov/genome/) and then were aligned to the assembly using tBLASTn v2.2.2645 (E-value ≤ 1e-5). The matching proteins were aligned to the homologous genome sequences for accurate spliced alignments with GeneWise v2.4.134 software. The IsoSeq pipeline (https://github.com/PacificBiosciences/IsoSeq) was employed to process full-length transcriptome sequencing data. The generated reads were aligned to C. glaucophyllus using HISAT v.2.0.446 with the default parameters and then the alignment was further processed by StringTie v.1.3.347. The nonredundant reference gene set was created by merging the genes predicted as described above with EVidenceModeler v1.1.148 using PASA49 (Program to Assemble Spliced Alignment) terminal exon support and including masked transposable elements as aninput for gene prediction. Furthermore, gene structure and gene elements, including average transcript length, average CDS length, and average exon and intron length, were compared among Cotoneaster glaucophyllus and the above six related species.

The tRNAs were predicted using the tRNAscan-SE50 program (http://lowelab.ucsc.edu/tRNAscan-SE/). As rRNAs are highly conserved, we selected reference rRNA sequences from closely related species and used BLAST to predict rRNA sequences. Additionally, other ncRNAs, such as miRNAs and snRNAs, were identified by searching against the Rfam51 database using the Infernal v1.134 with the default parameters. We annotated 35,856 coding genes (Tables 4) and 3,276 noncoding genes, including 1,401 miRNAs, 655 tRNAs, 425 rRNAs, and 795 snRNAs (Table 5).

Table 4 Statistics of gene structure prediction.
Table 5 Statistics of noncoding genes.

Gene functions were assigned by aligning the protein sequences to Swiss-Prot52 using Blastp53, with a threshold of E-value ≤ 1e−5, and the best match was considered. Motifs and domains were annotated using InterProScan v5.3154, which involved searching against publicly available databases, including ProDom55, PRINTS56, Pfam57, SMART58, PANTHER59, and PROSITE60. Gene Ontology (GO) IDs were assigned to each gene based on the corresponding InterPro entry. Protein function predictions were made by transferring annotations from the closest BLAST hit (E-value ≤ 1e−5) in the SwissProt database51 and DIAMOND v0.8.2261 hit (E-value ≤ 1e−5) in the NR database. Additionally, we mapped the gene set to a KEGG pathway and identified the best match for each gene. The functions of 34,967 genes (97.52%) were predicted (Table 6). Comparative analysis of gene elements among Rosaceae-related species revealed that the genome assembly of Cotoneaster glaucophyllus exhibits a shorter average exon length (229.78 bp) and a longer average intron length (508.51 bp) than those of other considered species (Fig. 4, Table 7).

Table 6 Summary of gene function annotations.
Fig. 4
figure 4

Comparative analysis of gene elements among Rosaceae-related species.

Table 7 Comparative analysis of gene elements among Rosaceae-related species.

Data Records

The raw data of Hi-C short reads, Illumina DNA short reads, PacBio DNA long reads, RNA short reads, and PacBio RNA long reads have been deposited in the National Center for Biotechnology Information (NCBI) Sequence Read Archive database with accession numbers SRR2593387962, SRR2593387863, SRR2593387764, SRR2593387665, and SRR2593387566 under BioProject accession number PRJNA1012579. The genome assembly has been deposited at GenBank under the WGS accession JAVVNS00000000067. Additionally, the genome assembly, predicted transcripts and protein sequences, functional annotation files (gff files), and NR and KEGG annotation files have been deposited in Figshare68.

Technical Validation

Multiple parameters were employed to assess the quality of the genome assembly. The BUSCO evaluation indicated that among the Eukaryota BUSCO genes, 62.9% (906) of the sequences were identified as complete and single-copy, while 30.3% (436) were complete but duplicated. Additionally, 1.1% (16) of the sequences were fragmented, and 5.7% (82) were found to be missing. Analysis of the 248 most highly-conserved Core Eukaryotic Genes (CEGs) revealed the presence of 238 complete genes (95.97%) and 6 incomplete genes (2.42%). The evaluation of the consistency between the assembly and paired-end DNA short reads indicated that the overall mapping and coverage rates were 94.61% and 99.99%, respectively. The rates of homozygous and heterozygous single-nucleotide polymorphisms (SNPs) were 0.001413% (798) and 0.288695% (163,081). Furthermore, we mapped the DNA continuous long reads (CLRs) to the genome using the minimap269, and calculated the sequencing depth and coverage for each pseudo-chromosome (Table 2). These results collectively demonstrate a genome assembly of high quality, completeness, and accuracy.