Background & Summary

The prickly nightshade, Solanum rostratum Dunal (Solanales: Solanaceae), an annual plant, is an invasive alien malignant weed which classified as an “agricultural weed”, an “environmental weed”, and a “noxious weed” in the Global Compendium of Weeds1. In China, it is listed as an entry quarantine pest and key management alien invasive species. This species has a fast growth rate and strong reproductive ability, whose seed production reaching 78,500 seeds per plant2. High competitiveness in light, water, nutrients, ecological niche, and other resources results in reduced agricultural land production, loss of native species’ competitive advantage, and decrease in biodiversity. In addition, the densely covered narrow and long prickles on the surface of the stem, leaf, calyx, and fruit can be mixed with fodder to hurt the oral cavity and gastrointestinal digestive tract of livestock. Moreover, the neurotoxin solanine present in whole plants can cause livestock poisoning3. It is also the host of the Colorado potato beetle Leptinotarsa decemlineata4, which is the most destructive pest on potatoes, the tomato golden mottle virus5, and the tomato severe leaf curl virus6. Thus, the invasion of S. rostratum seriously threatens the local ecological environment, agricultural production, grassland animal husbandry, biodiversity, and human health (Fig. 1).

Fig. 1
figure 1

Morphological characteristics of Solanum rostratum (a) habits in the grassland, (b) habits in the corn field, (c) infested by Colorado potato beetle in the field, (d) damage to livestock, (e) whole plant, (f) seedling, (g) flower, (h) root, (i) stem, (j) leaf, and (k) fruit.

The extremely strong ecological adaptability (could survive in the wasteland, grasslands, overgrazing pastures, roadside, garbage dumps, orchards, courtyards, irrigation ditches, and river beaches)7 and stress resistance (barren, drought, wet, and salt8) facilitate S. rostratum to spread and establish in a new environment as a dominant species. Native to North America9, S. rostratum is widely distributed in 34 countries and regions, including North America, Asia, Africa, Europe, and Oceania10. In China, since its first detection in 1981 in Chaoyang City, Liaoning Province11, it has spread to nine provinces and 54 counties within 30 years through water flow, wind, livestock trade, sand transportation, and other vectors.

Alien invasive plants usually can adapt to the new ecological environment and establish and expand populations within a short period12, which will seriously negatively impact the local ecosystem. High-quality reference genomes could help us profoundly comprehend and screen the genetic basis and variations associated with important traits and adaptation under different ecological and environmental conditions. Technological advances, including long-read sequencing by Pacific Biosciences (PacBio) or Oxford Nanopore Technologies (ONT), the chromosome conformation capture technique (Hi-C), and BioNano optical maps, have facilitated genome sequencing, assembly, and annotation, leading to the rapid expansion of the quantity and quality of public plant genomes in the past 20 years13,14. For the nightshade family, Solanaceae, which comprises approximately 90 genera and 3,000–4,000 species15, a total of 7 genera, 46 species, and 170 genomes have been reported. However, all previous genomic studies have focused on horticultural crops and their related wild species (for example, the cultivated tomato Solanum lycopersicum16 and the wild relative Solanum pimpinellifolium17, potato Solanum tuberosum18, hot pepper Capsicum annuum19, and eggplant Solanum melongena20), model plant organisms (tobacco Nicotiana tabacum21), ornamental flowers (Petunia inflata and Petunia axillaris22), and herbs (Datura stramonium23 and Lycium barbarum24). So far, the genome of the solanaceous malignant weed remains unsequenced. Therefore, a chromosome-level reference genome of S. rostratum is an essential resource to further elucidate the pathway and genes involved in ecological environment adaptation under different stresses, solanine biosynthesis, host shift from native host prickly nightshade to potato for Colorado potato beetle, etc., by integrating comparative genomics, functional genomics, metagenomics, and population genomics.

In this study, we constructed and annotated a high-quality chromosome-level reference genome using integrated sequencing data (Fig. 2). We performed an initial de novo assembly into a contig-level genome by Hifasm25 using PacBio High fidelity (HiFi) long-reads. Valid Illumina Hi-C paired-end reads were used to generate chromosome-level assemblies using the HiC-Pro pipeline26. After masked repeat sequences, three strategies were integrated to annotate the gene structure by EVidenceModeler (EVM)27, including homologous prediction against closely related species, transcriptome-based prediction using the transcripts generated from PacBio Isoform-Sequencing (Iso-seq) long-reads and Illumina Paired-end RNA-seq short-reads by Program to Assemble Spliced Alignments (PASA) pipeline28, and ab initio prediction based on the characteristics of genomic sequence data. After annotating protein-coding gene functional and protein domains against a related database, the completeness and quality of the genome assembly and annotation were evaluated by Benchmarking Universal Single-Copy Orthologs (BUSCO)29 analysis and genome mapping and coverage rates using Illumina Paired-end short-reads. These results indicate that the present genome assemblies and annotations are contiguous and accurate. Furthermore, comparative genomic analysis was conducted with other nineteen solanaceous species to provide insight into their phylogenetic relationship, divergence time, whole-genome duplication (WGD) events along the solanaceous speciation, and genomic evolutionary history. Thus, the present S. rostratum genomic resource will be a foundation for subsequent research on this weed.

Fig. 2
figure 2

Chromosome-scale assembly genomic landscape of Solanum rostratum. Circos plot from the outer to the inner layers represents the following: (1) 12 pseudo-chromosomes length at the Mb scale; (2) GC content per Mb; (3) repeat density per Mb; (4) Copia (blue) and Gypsy (purple) LTR retroelement density per Mb; (5) gene density per Mb; and (6) center: intra-genomic syntenic blocks of S. rostratum.

Methods

Plant material collection and preparation

Healthy mature plants of S. rostratum were collected from the wasteland in Chaoyang City, Liaoning Province, China (120.504360° E, 41.604752° N) in August 2021. After washing with deionized water, the roots, stems, leaves, flowers, and fruits were harvested. All the tissues were put into liquid nitrogen immediately and preserved in an ultra-low temperature freezer until use.

DNA library construction and genome sequencing

High molecular weight genomic DNA was extracted from tender leaves with a modified 2 × cetyltrimethylammonium bromide (CTAB) method30. Approximately 200 mg of tender leaves were ground to powder using liquid nitrogen and then added to 800 μL of CTAB lysis buffer in a 2.0-mL tube. After incubation at 65 °C for 60 min, 800 μL of phenol/ chloroform/ isopentanol (25:24:1) was added and centrifuged at 12,000 rpm for 10 min. The supernatant was extracted into another 2.0-mL tube with an equal volume of chloroform/isopentanol (24:1). After mixing by gentle inversion, the tube was centrifuged at 12,000 rpm for 10 min. The supernatant was extracted to another tube with 0.6 times the volume of precooled (−20 °C) isopropanol. After being placed at −20 °C for over 2 h, the tube was centrifuged at 12,000 rpm for 10 min. The pellet was washed twice with 75% ethanol and dissolved in 50 µL of DNase and RNase-free Water for further study.

For the Illumina whole-genome shotgun raw sequencing, the genomic DNA was randomly fragmented, and a library with an average insert size of 350 bp was constructed using the Illumina TruSeq Nano DNA Library Prep Kit (Illumina, USA) following the manufacturer’s instructions. The library was sequenced on the Novaseq 6000 Platform set in the PE150 program, generating a total of 103.47 Gb of raw data. After filtering by fastp v0.12.431 with default to remove low quality and short reads and cut adapters and polyG, 102.14 Gb (113.69×) clean data were retained for the genome size estimation (Table 1).

Table 1 Statistics of sequencing data for Solanum rostratum genome assembly and annotation.

The PacBio Sequel II System, based on single-molecule real-time (SMRT) sequencing technology under the Circular Consensus Sequencing (CCS) model, was used for whole-genome sequencing. The DNA template was sheared by g-TUBE (Covaries, USA) to an average size of 15–20 kb, and the target DNA fragments were obtained using BluePippinTM Size-Selection System (Sage Science, USA). The library was constructed using SMRTbell Template Prep Kit 1.0 (Pacific Biosciences, USA) following the procedure and loaded onto PacBio Sequel™ Systems to read the sequence. Finally, approximately 366.02 Gb subreads were obtained with an average length of 13.59 kb and an N50 length of 15.25 kb after removing adaptors in polymerase reads (Table 1).

RNA library construction and transcriptome sequencing

Total RNA was isolated from the roots, stems, leaves, flowers, and fruits, respectively, using the standard TRIzol protocol (Invitrogen, USA)32. Approximately 100 mg of tissue was ground to powder using liquid nitrogen, and then 1000 μL of TRIzol was added in a 2.0-mL tube. After allowing the solution to stand for approximately 5 min, 200 μL of chloroform was added, shaken vigorously for 30 s, and allowed to stand for 3 min. After centrifugation at 12,000 rpm for 15 min at 4 °C, the upper aqueous phase was extracted to another 1.5-mL tube with 500 μL of isopropanol and then mixed by gently inverting. After standing for approximately 10 min, the tube was centrifuged at 12,000 rpm for 10 min. The supernatant was removed, and the pellet was washed twice with 75% ethanol and dissolved in 50 µL of DNase and RNase-free Water for further study.

For the Illumina paired-end reads sequencing, the mRNA was synthesized to cDNA, and five libraries were constructed with an insertion size of 350 bp using a TruSeq RNA library preparation kit (Illumina, USA) following the manufacturer’s instructions. Whole-genome shotgun raw sequencing was performed using the Novaseq 6000 Platform set in the PE150 program. In total, 32.91 Gb of clean data were generated from the RNA-seq library after filtering using fastp31 (Table 1).

For Iso-seq under the CCS model, the RNA samples extracted from root, stem, leaves, flowers, and fruits were equally mixed for sequencing. cDNA was synthesized using a Clontech SMARTer PCR cDNA Synthesis Kit (Takara Biotechnology, China). Then, the SMRTbell library (cDNAs length over 4 kb) was constructed using the Pacific Biosciences SMRTbell template prep kit (Pacific Biosciences, USA) and sequenced on the Pacific Bioscience Sequel II platform. A total of 19.81 Gb subreads were obtained with an average length of 2,562 bp and an N50 length of 3,005 bp after removing adaptors in polymerase reads (Table 1). The exported subreads were analyzed using packages of SMRT link v10.1, including highly accurate consensus sequence calling using package ccs v6.0.0 (https://github.com/PacificBiosciences/ccs), primer removal and demultiplexing using package lima v2.1.0 (https://github.com/pacificbiosciences/barcoding/), polyA tail and artificial concatemers removal using package isoseq3 v3.4.0 (https://github.com/PacificBiosciences/IsoSeq), and clustering and polishing using package isoseq3 v3.4.0. Finally, approximately 387.83 Mb high-quality consensus isoform sequences were generated with an average length of 3,843 bp.

Contig-level genome assembly

The in-built High-Quality Region Finder (HQRF) was used to identify the longest high-quality regain for each read of exported subreads according to the signal noise ratio (SNR). HiFi reads were then generated from filtered subreads using the CCS model of SMRT link v10.1 with the following parameters: --maxLength = 50000, --minPasses = 3, and --minPredictedAccuracy = 0.99. The sequences in fastq.gz were converted from the BAM file using bam2fastx v1.3.1 (https://github.com/pacificbiosciences/bam2fastx/). 25.83 Gb (28.75×) of CCS reads were obtained with an average length of 15.34 kb and an N50 length of 15.78 kb (Table 1). Then, Hifiasm v0.16.025 was used to assemble the genome into contigs with default parameters. To check for the potential contaminant sequences, assembled contigs were classified using Kraken2 against the custom database33. Four contigs were identified as bacteria (904,041 bp, 0.10%), which were flagged and removed from the final assembly. After removal, the final contig-level assembly was submitted to the NCBI independent contamination check to confirm the result, resulting in an 898.42 Mb contig-level genome consisting of 113 contigs and an N50 length of 62.00 Mb (Table 2).

Table 2 Statistics of the Solanum rostratum genome assembly.

Hi-C library construction and pseudo-chromosome anchoring

Tender leaves were cut into approximately 2-cm2 pieces for cellular protein cross-linking in 2% formaldehyde. The isolated DNA was purified, digested with Dpnii restriction enzyme, tagged with biotin-14-dCTP, sheared into 300–600 bp fragments, and blunt-end-repaired. Then, the Hi-C library was sequenced using the Illumina NovaSeq platform, which generated 100.16 Gb filtered clean data (113.63×) to anchor contigs into pseudo-chromosomes (Table 1). The cleaned Hi-C sequencing data were aligned on the contig assembly using bowtie2 v2.2.534 to obtain the unique mapped paired-end reads using the following parameters:--very-sensitive -L 20--score-min L, -0.6, --0.2 --end-to-end --reorder --rg-id BMG --phred33-quals -p 5. Quality control of read alignment and pairing was conducted using HiC-Pro v2.7.826 to discard low-quality alignment, singleton, multiple hits, and invalid pairs. A total of 156,223,644 valid paired-end reads were used to build the interaction matrices and scale up the primary genome assembly in contigs to chromosome-scale scaffolds (pseudo-chromosomes). A total of 869.69 Mb of the contig-level assembled sequences (96.80% anchored rate) were anchored and orientated onto 12 pseudo-chromosomes, which was consistent with the karyotype (2n = 24) analysis35, with lengths ranging from 63.15 to 92.28 Mb (Table 3). In summary, the size of the pseudochromosome-level S. rostratum genome that was obtained was 869.69 Mb with 212 unanchored contigs (total length 28.73 Mb), with a contig N50 of 72.15 Mb (Table 2). To validate the correction of the pseudo-chromosome anchoring result, the pseudo-chromosomes were divided into bins of equal size in 50 kb to construct genome-wide interaction matrices based on the interaction signals between each pair of bins. The interaction matrix heatmap was visualized using HiCPlotter v0.6.636 (Fig. 3).

Table 3 Statistics of Solanum rostratum genome assembly result by Hi-C.
Fig. 3
figure 3

Heat map of genome-wide Hi-C intra-chromosome interactions in Solanum rostratum. The interaction density is measured by the number of supporting Hi-C reads and illustrated by the color bar from dark red (high density) to light pink (low density).

Genome annotation and functional prediction

Identifying repeat sequences

The repeat sequences in the genome were identified using a combination of homologous sequence prediction and ab initio prediction. For homologous sequence prediction, RepeatMasker v1.32337 and RepeatProteinMask v1.3638 were used to predict the homology sequences against known repeat sequences in the database RepBase39. For ab initio prediction, RepeatModeler open-1.0.840 was used to establish a de novo repeat sequence database, and RepeatMasker v1.32337 was used for prediction. Tandem Repeats Finder (TRF) v4.07b41 was used to find tandem repeat sequences in the genome. Combined with the results, 649.21 Mb repeat sequences were identified, accounting for 72.26% of the S. rostratum genome. The four predominant categories were long terminal repeats (LTR) (accounting for 46.06% of genome size), long interspersed nuclear elements (LINE) (3.62%), DNA elements (3.14%), and short interspersed nuclear elements (SINE) (0.22%) (Table 4).

Table 4 Statistics of repeat elements in the genome of Solanum rostratum.

Identifying non-coding RNA (ncRNA) gene

Rfam42 was used to predict ribosomal RNAs (rRNAs), small nuclear RNAs (snRNAs), and micro RNAs (miRNAs) by comparison with known non-coding RNA libraries. Transfer RNAs (tRNAs) were predicted using tRNAscan-SE v1.3.143. In total, 3,588 ncRNAs were annotated in the S. rostratum genome, including 547 miRNAs, 1,288 tRNAs, 1,110 rRNAs, and 643 snRNAs (Table 5).

Table 5 Statistics for non-coding RNA genes in the genome of Solanum rostratum.

Gene structure prediction

Three strategies were applied to predict the gene structure from the repeat-masked genome. The first strategy was homologous prediction. BLAST v2.2.2844 with an E-value cutoff of 1e-5 and GeMoMa v1.645 were used to predict gene structure by comparing with seven closely related species (C. annuum19, Solanum chilense46, Solanum commersonii47, S. lycopersicum16, S. melongena20, Solanum pennellii48, and S. tuberosum18). The second strategy was based on transcriptome data. The filtered Illumina RNA-seq sequences from five libraries were assembled into transcripts using Trinity v2.11.049 with default parameters. Then, the Trinity RNA-Seq assemblies and full-length cDNAs were aligned and mapped to the soft-masked genome assembly using GMAP v2014-10-250 and BLAT Src3551. Candidate gene structures were extracted from the PASA v2.128 pipeline based on the open reading frame (ORF). The third strategy included using ab initio prediction based on the characteristics of the genomic sequence data. Using Augustus v3.352, SNAP v3892653, and GeneMark v4.3354, 29,485, 33,190, and 26,142 protein-coding genes were identified, respectively. Finally, EVM v1.1127 integrated the above three strategies, resulting in a non-redundant gene set, with weighting as default. Overall, 29,694 protein-coding genes were obtained, with an average gene length of 4,308 bp, cds length of 1,172 bp, exon length of 237 bp, and intron length of 795 bp (Table 6).

Table 6 Summary of gene structure prediction by three strategies of Solanum rostratum.

Gene function annotation

Functional annotation of the protein-coding genes was carried out via BLAST44, with an E-value cutoff of 1e-5, against the public protein databases, including the Non-redundant protein database (NR) (ftp://ftp.ncbi.nlm.nih.gov/blast/db/FASTA/nr.gz), the nucleotide sequence database (NT) (https://www.ncbi.nlm.nih.gov/nucleotide/), SwissProt protein database (SwissProt)55, Kyoto Encyclopedia of Genes and Genomes (KEGG)56, Eukaryotic Orthologous Groups of proteins (KOG)57, and eggNOG-mapper v2.1.0-158. Protein domains were predicted by searching against the Protein Families Database (Pfam)59 using Hmmer v3.1b160 with default settings. Gene Ontology (GO)61 terms were obtained based on the corresponding InterPro62 or Pfam59 entry. A total of 28,154 genes (94.81%) were annotated using at least one public database (Table 7).

Table 7 Statistics for the Solanum rostratum functionally annotated protein-coding genes.

Solanaceous orthology identification, phylogenetic tree construction, and divergence time estimation

Twenty solanaceous species were selected for comparative genomic analysis, with Ipomoea trifida as the outgroup. The longest transcripts, which were extracted using TBtools v1.10663, were used as the gene set for the following analysis. The orthogroups and orthologs classification were identified using Orthofinder v2.5.464 with parameters -S diamond, -M msa, and -T fasttree. As a result, 824,030 genes (93.10% of total genes) were assigned to 56,426 orthogroups among 21 species, with 7,963 orthogroups shared in all the species and 799 shared single-copy orthogroups. Among the 29,694 genes in S. rostratum, 28,514 were clustered into 17,237 orthogroups, with 12,096 genes in single-copy orthologs, 16,418 genes in multiple-copy orthologs, and 1,065 genes in 298 species-specific orthogroups.

A phylogenetic tree was constructed using the concatenated 799 single-copy orthogroup gene alignment generated using Orthofinder64. The maximum-likelihood method software raxmlHPC v8.2.1265 was implemented with the parameters -m PROTGAMMAJTT, -f a, and -# 100. The solanaceous tree recovered the monophyly of 3 subfamilies, 5 tribes, and 6 genera with 100 support values at all nodes, revealing a sister group relationship between S. rostratum and S. melongena + S. aethiopicum (Fig. 4a).

Fig. 4
figure 4

Comparative genomic and evolution analysis of solanaceous species. (a) Phylogenetic topology constructed based on shared single-copy genes, and divergence times estimation of solanaceous species with Ipomoea trifida as an outgroup. All the nodes supported bootstrap values are 100. The blue bars on the nodes represent the divergence time range with 95% confidence intervals (million years ago, Mya). The below scale represents the geologic time divisions, covering Cretaceous (K), Paleogene (Pg), Neogene (N), and Quaternary (Q). (b) Whole Genome Duplication events revealed by synonymous substitution rate (Ks) analysis. The Ks frequency density distributions of syntenic orthologous or paralogous block pairwise within and between genomes of Solanum rostratum (Sros) and Vitis vinifera (Vvin), Ipomoea trifida (Itri), Solanum lycopersicum (Slyc), and Solanum melongena (Smel). (c) Whole-genome synteny between S. rostratum and two other closely related Solanum species (S. lycopersicum and S. melongena). Conserved syntenic blocks are highlighted with grey color corresponding to the twelve pseudo-chromosomes, indicating visible genome rearrangements occurred during evolution among Solanum species.

Four-fold Degenerate Synonymous Site (4DTv) was extracted from single-copy orthogroup genes to estimate the divergence time among Solanaceae using MCMCTree in PAML v4.10.366 with the following parameters: clock = correlated rates, model = H85KY, alpha = 0.5, burn in = 100,000, sample frequency = 2, and sample number = 1,000,000. Two calibrations were set, which were obtained from Timetree67: the divergence time between Solanaceae and Convolvulaceae (59.1–83.9 million years ago [Mya]), and the divergence time between S. lycopersicum and S. tuberosum (6.1–9.0 Mya). The results revealed that S. rostratum split from the common ancestor ca. 49.26 Mya (Fig. 4a).

WGD analysis

To investigate the WGD event history of Solanaceae, the synonymous substitution rate (Ks) frequency density distributions of syntenic orthologous block pairwise between genomes and syntenic paralogous block pairwise within genomes were calculated by wgd v1.1.268, including S. rostratum (Sros), Vitis vinifera69 (Vvin), Ipomoea trifida70 (Itri), S. lycopersicum16 (Slyc), and S. melongena20 (Smel). For one-versus-one orthologs Ks distributions calculation, the module dmd was implemented to extract orthologs by all-versus-all blastp using the diamond71 algorithm with the parameters–nostrictcds -e 1e-10. The module ksd66 was then used to construct one-versus-one ortholog Ks distributions. For whole-paranome Ks distribution calculation, the module dmd was used to extract paralogs and cluster gene families using the Markov cluster (MCL)72 algorithm. Then, the module ksd66 was used to construct whole-paranome Ks distributions with the parameter -mp 1000. Finally, the module syn identified and extracted paralogs in intra-genomic colinear blocks using i-ADHoRe v3.073. A shared peak was detected within Solanaceae at approximately 0.68, which occurred after the divergence peak with V. vinifera, and before the Solanaceae speciation peak, indicating that an an ancient WGD occurred in the ancestor of the Solanaceae. However, there was no subsequent WGD after species differentiation within the Solanaceae. Within Solanaceae, S. rostratum first diverged from S. lycopersicum at 0.14, and S. melongena at 0.03 (Fig. 4b).

Whole-genome synteny

To understand the extend of genomic rearrangement of S. rostratum during evolution, whole-genome synteny analysis was conducted between S. rostratum (Sros) and S. lycopersicum16 (Slyc), and between S. rostratum and S. melongena20 (Smel). The protein sequences of Sros and Slyc, and Sros and Smel were blasted using blastp with parameter -evalue 1e-5. The multiple alignments of syntenic blocks were identified by MCScanX74 with the parameter -s 15 (number of genes required to call a collinear block) and visualized by jcvi v1.2.875 with the parameter–minspan = 30. The complicated conserved syntenic blocks among the twelve pseudo-chromosomes, indicate that visible genome rearrangements occurred during evolution among Solanum (Fig. 4c).

Data Records

All raw sequencing data have been deposited in the NCBI Sequence Read Archive (SRA) (Table 1) under Bioproject number PRJNA932047, including the genomic Illumina sequencing data (SRR23354532)76, genomic PacBio HiFi sequencing data (SRR23354533)77, transcriptome Illumina sequencing data (SRR23354526-SRR23354530)78,79,80,81,82, Hi-C sequencing data (SRR23354531)83, and transcriptome Pacbio-Sequel II sequencing data (SRR23354525)84.

The final chromosome-level assembled genome sequences were deposited in the NCBI Assembly database under Accession Number JARACL00000000085.

The genome annotation results, including repeated sequences, gene structure, and functional predictions were deposited in the Figshare database (https://doi.org/10.6084/m9.figshare.22016024)86.

Technical Validation

Evaluation of the quality of genomic DNA and RNA

The purification, concentration, and integrity of the DNA template were quantitatively determined using a NanoDrop 8000 Spectrophotometer (Thermo Fisher Scientific, USA), Qubit Fluorometers (Thermo Fisher Scientific, USA), and Agilent 4200 Bioanalyzer (Agilent Technologies, USA), respectively. The evaluation results require the 15 kb insert library of the PacBio Sequel sequencing platform to meet the following criteria: including (1) the DNA content ≥ 10 μg, (2) the DNA concentration ≥ 80 ng/μL, (3) the DNA peak size was 32.59 kb which was over than 20 kb, (4) the DNA absorbance was 1.8 ≤ OD260/280 ≤ 2.0 and 1.6 ≤ OD260/230 ≤ 2.5.

The purification, concentration, and integrity of the RNA template were quantitatively determined using a NanoDrop 8000 Spectrophotometer (Thermo Fisher Scientific, USA), an Agilent 2100 Bioanalyzer (Agilent Technologies, USA), and an Agilent RNA 6000 Nano Kit (Agilent Technologies, USA), respectively. The evaluation results required all the meet for the Iso-seq library construction, including (1) the RNA content ≥ 4 μg, (2) the RNA concentration ≥ 250 ng/μL, (3) RNA Integrity Number (RIN) value ≥ 6.0, (4) the RNA absorbance was 2.0 ≤ OD260/280 ≤ 2.2, and 1.6 ≤ OD260/230 ≤ 2.1.

Evaluating the completeness and quality of the genome assembly and annotation

Flow cytometry analysis

FACScalibur Flow cytometry (BD Biosciences, USA) analysis87 was conducted to estimate the S. rostratum genome size with three replicates, and ModFit software v5.0 (Yerity SoftwareHouse, USA) was used to analyze the results. The genome size of the internal reference standard Glycine max is 978.4 Mb (1 pg DNA = 0.978 G)88. The 2 C DNA content in pg of S. rostratum was calculated according to the following formula89: \(S.rostratum\;2{\rm{C}}\;{\rm{DNA}}\;{\rm{content}}=\frac{{\rm{G}}1\;{\rm{peak}}\;{\rm{mean}}\;{\rm{of}}\;S.rostrarum\times C.max\;2{\rm{C}}\;{\rm{DNA}}\;{\rm{content}}}{{\rm{G}}1\;{\rm{peak}}\;{\rm{mean}}\;{\rm{of}}\;C.max}\). The peak values of G. max were 104.29, 106.54, and 103.70, respectively. The corresponding peak values for S. rostratum were 94.32, 97.10, and 94.28, respectively. The genome size of S. rostratum was estimated to be approximately 885.36–892.20 Mb, which was very close to the genome size of the pseudo-chromosome-level assembly in 898.42 Mb.

Mapped to the genome using Illumina data

Illumina paired-end reads were mapped back to the draft genome using Burrows-Wheeler Aligner (BWA) v0.7.9a90. Then, depth, mapping rates, and coverage at each position were calculated using samtools v0.1.1991. The results showed that 99.04% of read pairs were mapped to the genome with an average depth of 105.24 and a coverage rate of 97.54%, indicating high single-base concordance.

BUSCO assessment

The completeness of the contig-level genome, Hi-C pseudo-chromosome-level genome, and predicted gene datasets were further evaluated with BUSCO (default parameters) v5.1.229 based on the ortholog database embryophyta_odb10 (1,614 genes). The results were visualized by the python script generate_plot.py of BUSCO, showing a high completeness level with 99.4%, 99.5%, and 91.3% complete genes found in the contig-level genome, Hi-C pseudo-chromosome-level genome, and predicted gene datasets, respectively (Fig. 5).

Fig. 5
figure 5

Benchmarking of genome completeness of Solanum rostratum genome assembly and annotation, evaluated by BUSCO based on embryophyta_odb10 database which includes 1,614 genes. C: the number of complete genes, S: the number of complete and single-copy genes, D: the number of complete and duplicated genes, F: the number of incomplete genes, M: the number of missing genes.

Protein coding genes comparison with close species

To determine the prediction accuracy and reliability, the distribution of mRNA length, CDS length, exon length, intron length, and exon number in S. rostratum and other closely related species (C. annuum19, S. chilense46, S. commersonii47, S. lycopersicum16, S. melongena20, S. pennellii48, and S. tuberosum18) were determined. The consistent distribution tendency among all species further supported an ideal annotated gene dataset in S. rostratum (Fig. 6).

Fig. 6
figure 6

Annotated genes comparison of the distribution of (a) mRNA length (b) CDS length (c) exon length (d) intron length (e) exon number in Solanum rostratum with other closely related species. The x-axis represents the length or number and the y-axis represents the density of genes.

Hence, a high-quality completeness and accuracy S. rostratum genome was assembled and annotated in the present study.