Background & Summary

Herbal medicine has long been used effectively for disease treatment in East Asia, notably in Korea and China1. These medicines exemplify combination therapy, employing multiple compounds to treat diseases2. Such therapy is effective in multifactorial diseases as it addresses multiple targets3, and offers the advantage of fewer side effects such as drug resistance4. Tonifying herbal medicine (THM), a form of combination therapy, not only targets diseases directly but also aims to activate immunity5. Thus, THMs have the capacity to: (1) treat multifactorial diseases, (2) minimize drug resistance, and (3) manage conditions challenging to treat with conventional drugs by bolstering the immune system. This study therefore proposes a new paradigm in disease treatment.

Despite its numerous benefits, research on THM faces limitations, particularly in accurately identifying the mechanism of action (MOA) due to the involvement of multiple compounds. One solution is drug-induced transcriptome analysis6,7, a method of measuring relative expression levels of mRNA after treating a cell line with a specific drug. This method reveals the MOA of drugs through complex pharmacodynamic processes as reflected in mRNA expression8, providing insights into the therapeutic effects of both single drugs and compound combinations. Furthermore, it facilitates the identification of transcriptomic signatures responsive to these treatments, confirming the therapeutic mechanism of each approach. Therefore, transcriptome analysis, by elucidating the MOA of combination therapies, is pivotal in revealing the therapeutic mechanisms of herbal medicine prescriptions beyond individual herbs.

To uncover the therapeutic MOA of herbal medicines using transcriptome analysis, it is important to consider a variety of factors. One such variable to consider is the choice of cell line9. As the dominantly expressed genes vary across cell lines, analysis across diverse cell lines essential for confirming the MOA of herbal medicines on multiple targets. Therefore, generating transcriptome data from various cell lines is an efficient strategy that not only validates the therapeutic effects of herbal medicines, but also elucidates their mechanisms of action in various organs. Another critical factor is drug concentration, which can significantly influence the MOA of drugs10. For example, Finasteride, a well-known treatment for benign prostatic hyperplasia, demonstrates different effects at varying doses—acting on prostate cancer at higher doses (5 mg/day), and delaying hair loss at lower doses (1 mg/day)11. This variability demonstrates the importance of dose determination in confirming the MOA of herbal medicines. Additionally, the choice of solvent plays a pivotal role7, as natural products comprise both hydrophilic and hydrophobic compounds. Therefore, the fractions of compounds extracted from water and ethanol solvents vary, indicating different processing mechanisms. To reveal the therapeutic mechanisms of herbal medicine, multiple variables must be considered. Transcriptome data, produced in consideration of these variables, are important in elucidating the MOA of the drug.

In this Data Descriptor, we introduce KORE-Map 1.0 (Korean Medicine Omics Resource Extension-Map), featuring THM-derived transcriptome data available on the NCBI gene expression omnibus (GEO) platform. The data were generated from four THMs commonly used in clinical practice, along with the 10 herbs constituting them. The transcript expression information was derived from both water and ethanol extracts of THMs and herbs, prepared at three different concentrations and applied to four representative human-derived cell lines (A549, HepG2, HT29, and SW1783). Utilizing the MGIEasy RNA directional library prep kit and the MGISEQ-2000 sequencing system, both widely used worldwide, not only facilitates easy data reuse but also ensures excellent compatibility with other transcriptome datasets. The THM-derived transcriptome data produced in this study could serve multiple purposes, such as aiding in the identification of therapeutic MOAs involving multiple compounds, which is crucial for understanding therapeutic mechanisms in multifactorial or comorbid conditions. In addition, the transcriptome data, spanning various cell lines and concentrations, hold potential for applications in drug repositioning, side effect detection, and more, by enabling the simultaneous evaluation of the effects of multiple compounds on multiple targets and organs.

Methods

Preparation of herbs

Dried medicinal plants, conforming to the Korean Pharmacopoeia standards, were provided by Kwangmyung-dang Medicinal Herbs Co., located in Ulsan, Republic of Korea. These samples underwent an organoleptic examination by Dr. Choi Goya, a herbal medicine organoleptic examination expert appointed by the Korea Food and Drug Administration. The identification to species level was accomplished through DNA barcode region sequencing. Voucher specimens have been stored at the Korean Herbarium of Standard Herbal Resources, within the Herbal Medicine Resources Research Center, at the Korea Institute of Oriental Medicine in Naju, Republic of Korea (Table 1). All herbs and extracts, which were sourced from the Oriental Medicine Resources Research Center (KIOM), are available online at https://oasis.kiom.re.kr/herblib.

Table 1 Herbal Medicine Information and Yields.

Preparation method of hot water and 70% ethanol extracts of herbs and THMs

Hot water and 70% ethanol extracts of each plant were prepared and supplied by KOC Biotech Co., located in Daejeon, Republic of Korea. Initially, dried plants (1,000 g) were pulverized and extracted in 10 L of hot distilled water for 3 h using a reflux extraction system (MS-DM609; MTOPS, Seoul, Republic of Korea), or in 10 L of 70% ethanol for 1 h using an ultrasonication system (VCP-20, Lab companion, Dajeon, Republic of Korea) twice. The resulting extract solutions were filtered through a 5 µm cartridge filter, concentrated using a rotary evaporator (Ev-1020, SciLab, Seoul, Republic of Korea), and finally lyophilized in a freeze dryer (LP-20, Ilshin-Bio-Base, Dongducheon, Republic of Korea) to produce the final extracts. These extracts were then finely homogenized and packaged in glass bottles with desiccant silica gel. THMs were prepared by blending and homogenizing these extracts in accordance with the composition ratios and extract yields of the individual medicinal herbs, according to the Korean Pharmacopoeia (Table 2). For in vitro applications, extracts (100 mg) were vigorously vortexed for 30 min in 10 mL of phosphate-buffered saline (PBS; Thermo Fisher Scientific, Rockford, IL, USA) containing 2% DMSO. This mixture was then sterilized by filtration through a 0.22 µm membrane to obtain a stock solution (10 mg/mL), which was divided into small aliquots and stored at −80 °C until their use.

Table 2 Tonifying Herbal Medicine Mixing Information.

Cell culture

All cell lines were purchased from the American Type Culture Collection (Manassas, VA, USA) and were cultured in a basal medium enriched with 10% heat-inactivated fetal bovine serum, 100 IU/mL penicillin, and 100 µg/mL streptomycin, all within a humidified incubator (Table 3). Cell confluence levels between 80–90% prompted the replacement of the growth medium every 3–4 days to maintain optimal growth conditions. To ensure the absence of mycoplasma contamination, the MycoAlert PLUS mycoplasma detection kit (Lonza, Rockland, ME, USA) was employed for regular testing.

Table 3 Information of Cell Lines.

Drug treatment and total RNA preparation for RNA sequencing (RNA-seq) analysis

To determine the appropriate treatment drug concentrations, we performed cell cytotoxicity tests to investigate drug doses that maintained 80% cell viability (IC20s), which were then adopted as the maximal doses for RNA-seq data collection. For drugs whose IC20s could not be determined, the highest treatment concentrations were capped at 500 µg/mL for extracts, considering both their solubility and relevance for clinical application. To confirm the influence of concentration, cells were treated with three different concentrations using 1/5 serial dilutions, thereby exposing them to low, medium, and high doses. Positive control drugs such as wortmannin (Sigma, W1628), LY294002 (Sigma, L9908), and Thioridazine (Sigma, T9025) were incorporated into the assay for comparative analysis against the connectivity map (CMap) data. Cells treated with a 2% DMSO/PBS solution served as the vehicle control. One day before drug administration, cells were seeded into 6-well culture plates with 3 mL of growth medium. Following a 24 h treatment period, the cells were washed with ice-cold PBS, and total RNA was isolated using QIAzol RNA isolation reagents (Thermo Fisher Scientific) in accordance with the manufacturer’s instructions.

RNA-seq data generation and preprocessing

Total RNA (over 500 ng) from each sample was processed for the mRNA sequencing library using the MGIEasy RNA directional library prep kit (MGI Tech Co., Ltd., China), following the manufacturer’s instructions. The library concentration was quantified using the QuantiFluor® ssDNA System (Promega Corporation, WI, USA). The prepared DNA nanoball was sequenced on an MGISEQ system (MGI Tech Co., Ltd., China) employing 100 bp paired-end reads. The RNA-seq data quality was assessed using FastQC (v0.11.9). To remove common MGISEQ adapter sequences, TrimGalore (v0.6.6) was utilized. Trimmed reads were then mapped to the human reference genome assembly GRCh38 (hg38) using the STAR aligner (v2.7.3a) with default settings12. Gene transcript abundance, including expected read counts and transcripts per million, was quantified using RSEM (v1.3.3), with the gene annotation GRCh38.8413. The raw sequence data (FASTQ files) and the preprocessed expression values for each gene have been deposited in the GEO under accession numbers GSE244687, GSE244707, GSE244694, and GSE245912.

Differential gene expression analysis

Using the gene symbols of protein-coding genes, we utilized the collapseRows function from the WGCNA package (v.1.72-1)14, specifically designed to merge expression data for genes represented by multiple probes. This approach effectively reduces redundancy and potential noise, enhancing the clarity of subsequent analyses. Additionally, the filterByExpr function from the genefilter package (v.1.78.0)15 was utilized to exclude genes that failed to meet predetermined expression criteria across samples. This filtering ensured that only genes most likely to provide reliable and relevant signals were retained for analysis.

For evaluation of each set of treatment conditions—encompassing four cell lines, 14 herbs and THMs, two extraction methods, and three concentration levels— we conducted differential gene expression (DGE) analysis against the corresponding control samples. This analysis was performed using the Wald test statistic as implemented in the DESeq. 2 package (v.1.36.0)16. Differentially expressed genes (DEGs) were determined based on a fold-change threshold of 1.5 and an adjusted P-value of less than 0.05.

Clustering analysis

The fold-change values derived from the DGE analysis across all treatment conditions were clustered using the t-distributed stochastic neighbor embedding (t-SNE) algorithm. This machine learning technique, designed for dimensionality reduction, excels in visualizing high-dimensional datasets, making it a valuable tool for interpreting complex gene expression patterns. The analysis was conducted utilizing the Rtsne package (v.0.16), with the perplexity parameter set to 1017.

Comparisons with connectivity map transcriptome data

Connectivity Map data were obtained from the Clue.io platform(clue.io/data/CMap2020#LINCS2020). For our analysis, we selected level five gene expression signatures with high reproducibility, defined by moderated z-scores that met specific criteria (distil_cc_q75 > 0.5 and pct_self_rank_q25 > 0.05), to compare with our RNA-seq data. The R package CMapR (v1.8.0) was used to manipulate the level 5 GCTX file (level5_beta_trt_cp_ n720216 × 12328.gctx). Given the variance in gene expression profiling between our RNA-seq data and L1000 assays9 used in CMap, a direct comparison between gene expression values was difficult due to distinct distributions of expression values. To navigate this, we employed gene set enrichment analysis (GSEA) as an alternative method to explore the genome-wide perturbing effects of treatments such as wortmannin at the pathway level18. We utilized 2,229 gene sets from several databases—Hallmark, Biocarta, KEGG, REACTOME, PID, and Wikipathways—available through MSigDB (https://www.gsea-msigdb.org/gsea/msigdb). The analysis involved performing GSEA on all genes, ranked according to their Wald test statistics or level5 z-scores. To obtain the MSigDB gene sets and conduct GSEA, we utilized the R package MSigDBR (v7.5.1) and FGSEA (v3.18). From the GSEA results, we defined pathway activity score (PAC) as ‘sign (enrichment score) × -log10(p-value)’ value to quantify the significance level. PAC vectors of equal lengths (n = 2,229) were generated for both our dataset and the CMap dataset. Subsequently, we determined the Pearson correlation coefficient to assess the relationship between the PACs from our samples and those from CMap (Fig. 1).

Fig. 1
figure 1

Introduction to transcriptomic data production protocols and herbal drugs used. (A) Overview of standard operating procedures for producing standardized transcriptome data. (B) List of herbal drugs processed for transcriptome data production.

Data Records

All raw transcriptome data were uploaded to GEO in the FASTQ format using paired-end sequencing files. Each data file was presented in two fq.gz formats. Essential details such as the production method, adopted cell line, and dosage information were included in the metadata accompanying the GEO submission19,20,21,22 (Table 4). The dataset submitted to GEO comprised 1,092 RNA sequencing samples across 21 batches (Table 5, Supplementary Tables 16). Transcript samples were derived from four distinct cell lines: 270 (A549; accession number, GSE24468719), 270 (HepG2; accession number, GSE24468720), 273 (HT29; accession number, GSE24468721), and 279 (SW1783; accession number, GSE24468722). The difference in data volume between HT29 andSW1783 cell lines can be attributed to two factors:(1) A discrepancy in drugs used as positive controls, and (2) variations in the number of transcript production batches due to technical issues. Wortmannin, known for its anti-inflammatory properties, served as a universal positive control across all cell lines. Further, LY294002 produced in HT29 cells, and LY294002 and Thioridazine produced in SW1783 cells served as additional positive controls, contributing three and six samples, respectively. Consequently, six batches were specifically allocated for the SW1783 cell line, with an inclusion of three extra samples.

Table 4 GEO accession number.
Table 5 Number of transcriptome data samples and batches per cell line.

Technical Validation

RNA quality and integrity

To ensure the suitability of samples for downstream sequencing, RNA quality and integrity were thoroughly evaluated. The optical density at 260 and 280 nm was measured using the Trinean Dropsense™96 micro-volume reader. The A260/A280 ratio serves as an estimate of RNA purity, with values exceeding 1.8 indicating relatively high purity. Our analysis revealed that the RNA samples typically exhibited a ratio close to 1.8, signifying a substantial proportion of pure RNA (Fig. 2a). Furthermore, the 28S/18S rRNA ratio and the RNA integrity number (RIN) were measured using an agilent bioanalyzer DNA chip to assess the extent of RNA degradation. All RNA samples demonstrated a 28S/18S ratio approximately equal to 2 and an RIN value of 7 or above, reflecting high RNA quality and integrity (Fig. 2b). These results suggest that the RNA is of suitable quality for downstream RNA sequencing23.

Fig. 2
figure 2

Quality assessment of RNA samples. (A) The A260/A280 ratio for individual samples grouped by four cell lines. The minimum of values widely recognized as indicative of high purity RNA is represented by the dotted line. (B) The 28 s/18 s rRNA ratio (left) and RNA integrity number (right) for individual samples grouped by four cell lines. Each dotted line represents a minimum value widely known to reflect high RNA quality and integrity.

Quality of RNA-seq data

The quality of the raw RNA-seq data was assessed using FastQC (v0.11.9), a software that generates a detailed report, including metrics such as per-base quality scores and GC content distribution. A representative FastQC report indicated that the overall read quality was high (Fig. 3b). Similar quality metrics were observed in all other FastQC reports, qualifying them for further analysis. To obtain clean data, adapter sequences and low-quality bases (Phred score below 20) were removed using TrimGalore (v0.6.6). As a result, a high percentage of reads, with a median of 96.67%, were successfully and uniquely mapped to the human reference genome GRCh38 (hg38) (Fig. 3b)24.

Fig. 3
figure 3

Quality evaluation of RNA sequencing data. (A) Representative FastQC report showing per sequence quality scores (left) and GC content (right) for A549 cell line treated with dimethyl sulfoxide (DMSO). (B) Summary of unmapped, multiple-mapped, and uniquely mapped reads against the reference genome for each cell line.

Biological and technical reproducibility

To ensure the reproducibility of our RNA-seq data, we quantified biological and technical batch effects by analyzing expression levels (TPM values for 19,826 protein-coding genes). Initially, the biological reproducibility was assessed through the analysis of three independent biological replicates for each treatment condition; cell line, treatment, and dose. Each replicate involved separate RNA extraction, RNA-seq library preparation, and sequencing processes. We calculated the pairwise Pearson’s correlation coefficient among replicates to quantify their similarities. This revealed a high degree of biological reproducibility, with an average correlation coefficient of 0.994 across all conditions. Furthermore, 97.8% of the conditions exhibited an average expression level correlation exceeding 0.95 across the three replicates (Fig. 4a).

Fig. 4
figure 4

Replicability of RNA-seq profiles. (A) Distribution of Pearson’s correlation coefficients for replicates (yellow) versus different samples (gray). (B) Heatmap of Pearson’s correlation coefficients among replicate samples across various sequencing batches.

Technical reproducibility was subsequently evaluated to address potential batch effects arising from sequencing. Since a single sequencing lane can accommodate up to 60 samples, we distributed samples from the same cell line across six different sequencing batches (A to F). Control samples treated with the vehicle were included and sequenced in all six batches, to assess batch effects. The correlation coefficients between control samples from different batches were calculated, indicating minimal batch effects. Notably, all control samples exhibited high correlation coefficients (>0.99) with samples sequenced in different batches (Fig. 4b).

Comparisons with external drug-induced transcriptome data

To verify the reliability and reproducibility of our RNA-seq data, we compared our drug-induced transcriptome profiles to those generated by the CMap 9, a comprehensive database featuring gene expression profiles of human cell lines treated with various bioactive compounds. We chose wortmannin, an established positive control that is also included in the CMap dataset, as a benchmark for our analyses across three cell lines: A549, HEPG2, and HT29.

To facilitate direct comparison between transcriptome data generated from different platforms, we converted gene-level expression values to pathway-level scores. This approach aggregates the expression changes across genes within 2,229 well-defined biological pathways, providing a more robust and interpretable measure of pathway activation or inhibition.

We then compared the pathway-level scores resulting from wortmannin treatment in our study with those generated by CMap. The pathway-level scores from our wortmannin treatment analysis showed a high correlation with those obtained from CMap (Fig. 5). This notable concordance serves as strong evidence of the reliability and reproducibility of our RNA-seq data, affirming its ability to capture drug-induced changes in cellular pathways.

Fig. 5
figure 5

Transcriptome data comparisons with CMap. Distribution of Pearson’s correlation coefficients for samples under the same treatment condition (yellow) versus different conditions (gray).