Background & Summary

Hypericum perforatum L. is an herbaceous perennial plant native to Europe, Asia, and North Africa, which is used in traditional medicine worldwide since ancient times1,2. Extracts of H. perforatum are used in the treatment of various stages of depression3. H. perforatum contains unique secondary metabolites, namely hypericin and hyperforin4,5, which are the two most studied extensively for their pharmacological properties6,7. Although the medicinal application of H. perforatum in the pharmaceutical industry has already reached a multi-billion dollar market, the pathways leading to the biosynthesis of bioactive metabolites are not understood adequately8. On the other hand, this species is considered as a toxic and invasive weed in some parts of the world9,10.

Comparative transcriptome analyzes using next generation sequencing techniques were utilized widely to understand biosynthetic pathways and other aspects of plant biology. Several attempts have been made to sequence and analyze the transcriptome of H. perforatum de novo to predict the unigenes involved in the biosynthesis of secondary metabolites11,12. However, genes implicated in biosynthesis pathway of hypericin and hyperforin are not characterized so far. Although whole genome data of H. perforatum was reported recently13, a complete annotated genome is not yet available in public for any of the Hypericum species.

The advent of single-molecule real-time sequencing (SMRT) has the potential to decipher the complex transcriptomes of plants, animals and microbes14,15,16,17,18,19. The advantage of SMRT technology lies in the ultra-long sequences, which in principle correspond to the full length of transcript sequences subjected to sequencing. In recent days, SMRT sequencing has been widely used to enrich the reference genome in expression studies of plant species or cultivars without a complete reference genome20,21,22,23. SMRT sequencing is also widely used in studies of alternative splicing, sequence repeats, long noncoding RNA, fusion genes, and gene discovery14,20,24. Genes involved in flavonoid and terpenoid biosynthesis were identified using SMRT sequencing in Ginkgo biloba14. Using SMRT technology, the complete sequences of unique genes involved in the synthesis of triterpene saponins were identified in Panax ginseng25. Similarly, full-length sequences of genes representing the enzymes for the berberine biosynthesis pathway in Berberis koreana were identified using SMRT sequencing26.

In the present study, we constructed a reference library for H. perforatum of PacBio Iso-Seq long reads using the SMRT sequencing approach. By aligning Illumina-based RNAseq data from plantlets and cell suspension cultures, we validated the library by obtaining the expression levels of several genes involved in secondary metabolism as these cultures differ greatly in the accumulation of secondary metabolites27. The transcriptome reference library for H. perforatum generated in this study will be a valuable tool as a reference library for comparative transcriptomic analyzes in H. perforatum

Methods

Plant material

In the present study, the H. perforatum cultivar ‘Helos’ (Richters Seeds, Ontario, Canada) was used, which is tetraploid (DNA ploidy, 2n~4x~32 chromosomes) and its reproduction mode is facultative apomixis, as revealed by flow cytometric analysis28,29,30.

Surface sterilized seeds were aseptically germinated and grown seedlings were transferred to Murashige and Skoog’s (MS - Duchefa Biochemie, Haarlem, The Netherlands) liquid medium supplemented with 0.1 mg/L α-naphthaleneacetic acid (NAA - Sigma-Aldrich, USA) and 1 mg/L 6-benzyladenine (BA - Sigma-Aldrich, USA) to establish shoot cultures containing plantlets (Fig. 1A,B). H. perforatum cell suspension cultures were established as reported before1. Cell suspension cultures were maintained in Erlenmeyer flasks (250 ml) containing MS and 0.5 mg/L NAA on an Orbi shaker (Benchmark Scientific, USA) at 110 rpm in a growth chamber with a photoperiod of 16/8 (day/night), an irradiance of 80 µmol m−2 s−1, 70% relative humidity, and a temperature of 25 °C. To keep the cells in the growth phase, 10 ml of the grown culture was subcultured into 70 ml of fresh medium once in 7 days (Fig. 1C,D).

Fig. 1
figure 1

Images of H. perforatum plantlets (A), close-up of a leaf with hypericin glands (B), cell suspension culture (C), and microscopic view of cell suspension culture (D) used to extract RNA for Iso-Seq analysis.

RNA extraction

The biomass of the cell suspension culture and plantlets was freshly harvested and ground into a fine powder in a sterile mortar and pestle with liquid nitrogen. Total RNA was isolated from 100 mg of biomass using the Sigma Spectrum RNA extraction kit (Sigma-Aldrich, USA). To remove any DNA contamination on column DNase treatment (Sigma-Aldrich, USA) was performed to purify the RNA according to the manufacturer’s instructions. RNA quantity was measured using the NanoDrop™ OneC, and RNA integrity was measured using the Agilent Bioanalyzer 2100. RNA samples with an RNA integrity number (RIN) greater than 7 were used for Illumina sequencing, and samples used for Iso-Seq library preparation had an RIN value of at least 8.

High-throughput sequencing

Iso-Seq library preparation and sequencing were performed using the full-length PacBio cDNA libraries and sequencing kit according to the manufacturer’s protocol (Pacific Biosciences of California, Inc., Menlo Park, CA, USA) by the sequencing service provider (Novogene, Beijing, China) on the Sequel system. For RNA sequencing, the mRNA was randomly fragmented and the cDNA was synthesized using random hexamers primers. The second strand was synthesized using custom second-strand synthesis buffer (Illumina), dNTPs, RNase H, and DNA polymerase I. The double-stranded cDNA was enriched by PCR after size selection, and Illumina sequencing was performed in 150 bp paired-read mode (Novogene, Beijing, China).

Construction of a reference library

Processing of raw sequencing data was performed by a standard Iso-seq protocol in SMRT® Link v. 9.0 adhering to the guidelines provided by the manufacturer (https://www.pacb.com/wp-content/uploads/SMRT_Tools_Reference_Guide_v90.pdf), consisting of generating consensus sequences from raw data, removing primers, removing noise, clustering, polishing, and filtering high-quality transcripts. In processing, the parameter “minPasses 1” was set, and only sequences with both 5′ and 3′ adapters and a Poly(A) tail were selected to obtain the set of full-length transcripts. Annotation of transcripts was done in OmicsBox ver. 2.0.10 (https://www.biobam.com/omicsbox) by blast against nr protein database restricted to the data from taxon Magnoliopsida (e-value cut-off 1e-10) and other tools from the functional analysis suite in that software, e.g., Gene Ontology terms and KEGG pathway assignment. Completeness of the transcriptome was assessed in Busco v. 3.0 using lineage data for Eudicotyledons, with default parameters, at www.cyverse.org31.

Data analysis

To validate the library, transcripts expression was assessed based on mRNA Illumina sequencing of 4 samples from cell suspension cultures and 3 samples from plantlets. Quantifiication was done using Salmon v. 0.12.032 in mapping mode with the set of PacBio-based transcripts as reference sequences. Repeat sequences in transcripts were identified using RepeatMasker ver. 4.0.9 (http://www.repeatmasker.org) with RepeatMasker-RepBase33 Sequence Database (species maize, file RMRBSeqs.embl rel. 20181026, www.girinst.org), at public server usegalaxy.org34. Prediction of coding/noncoding sequences was done using the lncFinder ver. 1.1.4 package in R (ver 4.1.0) (using the “wheat” model) and PLEK35 (with parameter minlength = 1). Alternative splicing (AS) was investigated by gapped mapping of Illumina reads in sets of PacBio transcripts using TopHat ver. 2.1.1 (parameters: number of mismatches 1,–no-discordant), and analysis of coordinates of discovered exon junctions (corresponding to introns retained in PacBio transcripts); only AS events supported by data from all biological replications and by coverage of more than 50 Illumina reads in each replication were selected. Statistical characteristics and tests (χ2 test for comparison of frequencies, Mann-Whitney test for comparison of distributions) were computed in Genstat ver. 1936. Statistical graphs were made using GraphPad Prism software version 9 (GraphPad, La Jolla, CA, USA).

Quantitative real-time PCR

To further validate the library, specific fragments of genes putatively involved in hypericin biosynthesis such as polyketide synthase1 (PKS1), polyketide synthase 2 (PKS2), isochorismate synthase 2 (ICS2), chorismate synthase (CHOSYN), polyketide cyclase (PKC), berberine bridge enzymes (BBE7, BBE13, BBE15 and BBE17)5,9 were analyzed with the SYBR Green-based Q-RT-PCR assay using a Lightcycler 480 real-time PCR system (Roche, Switzerland). The assay was performed as in a previous study37. Briefly, each 10 μL Q- RT-PCR reaction consisted of 4.2 μL diluted cDNA template (0.1 μg), 5 μL SensiFAST SYBR No- ROX (Bioline, UK), 0.4 μL forward primer (10 μM), and 0.4 μL reverse primer (10 μM). Amplification was performed under the following conditions: initial denaturation at 95 °C for 5 min, followed by 35 cycles of denaturation at 95 °C for 10 s, annealing at 60 °C for 15 s, and extension at 72 °C for 25 s. Relative expression was calculated using the 2−ΔΔCt method38.

Data Records

Sequence data were deposited into the functional genomics data collection (ArrayExpress) of European Bioinformatics Institute and are available with accession numbers E-MTAB-1142339 and E-MTAB-1132540. Annotation data, protein representation data and Q-RT-PCR assay data were deposited into figshare and made available to public41.

Technical Validation

Transcript characteristics

Long cDNA sequence reads of RNA isolated from cell suspension cultures and plantlets of H. perforatum were obtained using SMRT sequencing technology39,42. The length of the transcripts varied from 72 bp to 9041 bp in the cell suspension culture and 268 bp to 7986 bp in the plantlets (Table 1). In terms of a mutual blast (identity >95%), 83.19% of the transcripts from the cells and 68.21% of the transcripts from the plantlets matched. Thus, 5585 (16.81%) and 17606 (31.79%) of the transcripts were specific to cells and plantlets, respectively.

Table 1 Characteristics of Iso-Seq sequencing data and of obtained high-quality transcripts.

Transcript annotation

From the cell suspension culture and plantlet PacBio data, 97.22 and 98.17% of the transcripts were matched to the nonredundant (nr) protein database, respectively (Table S1). Of all transcripts, 89.37% of transcripts from cell suspension culture and 90.9% of transcripts from plantlets were annotated with known proteins (Table 2). A total of 78297 unigenes from both data sets were annotated with 17254 known proteins41,43.

Table 2 Fractions of blasted transcripts annotated to protein groups.

Proteins were represented by 1 to 320 transcripts. Of all proteins, 3536 (20.49%) were found only in cells and 7465 (43.27%) were found only in plantlets, whereas 6253 (36.24%) proteins were observed in both data sets. The transcripts were annotated with protein sequences from 24 plant species, including Hevea brasiliensis (12,244 annotations), Jatropha curcas (10,494 annotations), Manihot esculenta (8422 annotations), Ricinus communis (7733 annotations), H. perforatum (1736 annotations), etc. (Table 3).

Table 3 Taxonomy of origin of best hits in annotation to proteins (species with number of transcripts >500).

In total, the transcripts were assigned to 6103 GO terms and 1320 enzymes (Table S2). The annotations mainly related to the data sets are shown in Table S3 and Table S4. The BUSCO tool identified 48.7% and 65.2% of the genes in single copy (for the eudicotyledons) in the transcriptomes of the cell suspension cultures and plantlets, respectively (Table S2).

Repeat Sequences in H. perforatum Transcript

The frequency of repeat sequences in transcripts was determined. Sequence data from cell suspension cultures contained 2.3 Mb of repeat sequences in 14619 transcripts. Similarly, plantlet data had 2.1 Mb of repeat sequences in 25729 transcripts (Table 4).

Table 4 Characteristics of repetitiveness in transcripts.

A total of 15833 trinucleotide sequence repeats (SSRs) and 11848 dinucleotide SSRs were identified. TCT, TCC, GAA, CTT, CTC, CCT, and AAG were the most frequently observed trinucleotide SSRs, and four simple repeats, namely TC, GA, CT, and AG, represented 92% dinucleotide SSRs (Fig. 2).

Fig. 2
figure 2

Frequency of trinucleotide and dinucleotide SSR motifs.

Among the identified transposon elements, long interspersed nuclear elements (LINE) and long terminal repeat (LTR) retrotransposons were predominantly represented, with frequencies higher in cell sequences (Table 5). LTR/copia and LTR/gypsy were the highly represented LTR in H. perforatum sequence data. Additionally, other TEs, including hAT, Mutator-like elements (MULE), P instability factor (PIF), and helitron, also were abundant in the data.

Table 5 Repeats with frequency significantly different from expected under the hypothesis of homogeneity between cells and plantlets (Standardized residuals obtained in the chi-square test for equality of frequencies in cell and plantlets).

Coding prediction

The selection of coding/non-coding RNAs was done by two tools: lncFinder and PLEK. Their predictions agreed for 77.08% of transcripts (70.06% as “coding”, 7.02 as “non-coding”) (Table 6).

Table 6 Results of prediction of coding/noncoding RNAs.

Functional Annotation of transcripts

Functional classification of transcripts was carried out using Gene Ontology (GO) terms analysis. In total, 27516 transcripts from cells and 46902 transcripts from plantlets were assigned to GO terms (Fig. 3)41,42. Most dominant subclasses in biological process were metabolic process and cellular process.

Fig. 3
figure 3

Gene Ontology classification of H. perforatum transcripts.

In addition, Kegg’s pathway mapping was performed to identify genes involved in secondary metabolism41,42. Using the enzyme codes, 30900 transcripts from plantlets and 18953 transcripts from cells linked to 150 pathways were predicted after manual curation out of which, 6147 transcripts from plantlets and 3964 transcripts from cells were associated with biosynthesis of secondary metabolites (phenylpropanoid, anthocyanin, flavonoid, isoflavonoid, alkaloid, polyketide, stilbenoid, terpenoid, brassinosteroid etc) pathways (Fig. 4).

Fig. 4
figure 4

KEGG classification of H. perforatum transcripts. A- Metabolism; B- Genetic Information Processing; C- Environmental Information Processing; D- Cellular Processes.

Identification of Alternative Splicing Isoforms

In the current study, 494 alternative splicing events in 376 clusters (transcripts) were identified based on gapped mapping of Illumina reads in transcripts. In this, 130 clusters were in cell suspension culture sequence data, and 245 were in sequence data of plantlets. Insertions (intron retentions) of the size of 100–200 bp were predominantly observed (Fig. 5A). Most of the transcripts with alternative splicing events had one isoform, while a maximum of 7 isoforms was observed in one transcript (Fig. 5B).

Fig. 5
figure 5

Alternative splicing in H. perforatum transcriptome.

Expression data

Illumina sequencing of 7 samples yielded data sets with 45.0–52.4 million and 42.7–48.3 million read pairs for cells and plantlets, respectively. Expression levels (mean across all biological replicates) were calculated and transcripts were classified as “expressed” if the mean number of mapped reads was greater than 10 (Table S5).

A similar expression analysis was performed at the level of groups of transcripts assigned to the same protein, called “protein groups” (Table S5). The expression level for each protein group (in samples representing cells and plantlets) was determined as mean expression over all transcripts in the group. Protein groups were classified as “expressed” if the mean expression was greater than 10

To confirm the quality of the expression data, 9 genes were validated for their differential expression via quantitative real-time PCR (Q-RT-PCR) analysis44 in plantlets and cells. (Fig. 6).

Fig. 6
figure 6

Quantitative Real-time PCR analysis of the PKS = Polyketide synthase, ICS = isochorismate synthase 2, CHOSYN = chorismate synthase, PKC = Polyketide cyclase and BBE = berberine bridge enzymes) in plantlets compared to cell suspension culture.