Introduction

Colorectal liver metastasis (CRLM) is the most prevalent form of metastasis in colorectal cancer and is a challenging and key point in the clinical treatment of colorectal cancer1. Tumor metastasis is the discontinuous implantation of a primary tumor that unequivocally marks it as malignant2. However, the process of tumor metastasis and the reasons why tumors tend to metastasize to certain organs are not well known today3. An in-depth understanding of why colon cancer is more likely to metastasize to the liver might be helpful in the treatment of colon cancer liver metastasis. In general, we determine the origin of metastases based on their pathological morphology and immunohistochemical features; for instance, CDX2, an intestine-specific gene, is used as a biomarker for metastatic cancer originating from the intestines4,5. Recently, a study reported the presence of some hepatocyte gene programs and the absence of colon gene programs in tumor cells from liver metastases of colon cancer, which conflicts with our general understanding that tumor metastatic cells should retain the characteristics of the cells of the original tissue6. Therefore, screening for metastasis-specific genes may further elucidate why colon cancer is prone to liver metastasis and provide a research direction for future therapies that may block liver metastasis of colon cancer.

Currently, machine learning based on big data has achieved spectacular results in healthcare7,8. Clinical big data refer to a large volume and variety of data originating from complex sources9,10. Machine learning refers to a discipline of research and algorithms that focuses on finding patterns in data and using those patterns to make predictions11. The combination of clinical big data and machine learning can be highly conductive10,12.

In this study, different algorithms were selected based on machine learning to identify prognostic genes with discriminatory efficacy for colorectal liver metastasis and primary tumors, defined as colorectal liver metastasis-specific genes, exemplified by SPP1. Furthermore, we verified the cancer-promoting properties of SPP1 using cytological analysis. The results of enrichment analysis and immunoinfiltration analysis suggested that the infiltration level of M2 macrophages, which are recognized by scavenger receptors (CD206)13,14, is highly correlated with SPP1 and elicits extracellular matrix (ECM) remodeling15.

Materials and methods

Data collection and screening of colorectal liver metastasis-specific genes

RNA sequencing data and clinical information of colorectal cancer samples were obtained from the Gene Expression Omnibus (GEO) database (https://www.ncbi.nlm.nih.gov/geo/), including four cohorts: GSE41258, GSE49355, GSE68468, and GSE81585. GSE41258 contains 390 samples, including 186 cases of primary colorectal tumors and 47 cases of colorectal liver metastases, and was used as the training set for the present study. Three independent cohorts, GSE49355, GSE68468, GSE81585, and The Cancer Genome Atlas (TCGA) database (https://portal.gdc.cancer.gov/), were used as validation sets.

Differential gene expression between primary tumor and liver metastasis was performed by R (4.2.2 version) with ‘limma’ package. The threshold for differentially expressed genes (DEGs) was set at |log2FC|> 1 and p-value < 0.05. We referred to these DEGs as liver-metastasis-specific genes. The process flow diagram for this study is shown in Fig. 1.

Figure 1
figure 1

Flow diagram of the current study. CRC, Colorectal cancer; CRLM, Colorectal liver metastasis; PPI, Protein–protein interaction; ROC, Receiver operating characteristic curve; DCA, Decision curve analysis; TCGA, The Cancer Genome Atlas; DEGs, Differentially expressed genes; EdU, 5- Ethynyl-2’- deoxyuridine.

Functional enrichment analysis and construction of protein–protein interactions (PPI) network

Gene Ontology (GO) analysis is a way to summarize haphazard liver metastasis-specific genes into a more holistic event. We performed GO analysis with ‘clusterProfiler’ package, ‘org.HS.eg.db’ package16. The statistical significance of pathways is dependent on a normal p-value < 0.05, and FDR q-val < 0.25.

A PPI network was established using the STRING database (http://string-db.org)17. The liver metastasis-specific genes sequences were included in the ‘Multiple proteins’ module of the STRING database for analysis. Besides, we used ‘cytoHubba’ plugin for hub gene analysis, involving ‘Maximal Clique Centrality (MCC)’ and ‘Edge Percolated Component (EPC)’ algorithms.

Machine learning algorithm

Four different machine learning models were applied for the dichotomous classification of liver metastases and primary tumors, including ‘Random Forest (RF),’ ‘Support Vector Machine Recursive Feature Elimination (SVM-RFE)’, ‘Least Absolute Shrinkage and Selection Operator (Lasso)’ and ‘eXtreme Gradient boosting (XGboost).’

The RF model, a supervised machine-learning algorithm based on classification trees proposed by Breiman and Cutler in 2001, improves the prediction accuracy of the model by aggregating a large number of classification trees. We performed the RF algorithm using the ‘randomForest’ R package18 (ntree = 500).

SVM-RFE is an algorithm for the feature extraction of data based on the embedded method. They are widely used in pattern recognition and machine learning. We performed the SVM-RFE algorithm using the ‘e1071, ‘ ‘kernlab’ and ‘caret’ R packages19.

Lasso is a data mining method that simplifies the model to avoid overfitting it. The R package ‘glmnet’ was used for the analysis in this study20.

In data modeling, the XGboost algorithm combines hundreds of tree models with low classification accuracy into a predictive model with high accuracy. In our study, the ‘xgboost,’ ‘tidyverse’ and ‘caret’ R packages were used for the XGboost algorithm21.

We then identified the intersecting genes of the four machine learning models as hub genes for colorectal liver metastasis-specific genes.

The discrimination and generalization performance of liver metastasis-specific genes

We performed univariate and multivariate logistic regression analyses to assess the discrimination performance of hub colorectal liver metastasis-specific genes and develop a model. The regression coefficients of the genes were used to calculate linear predictors for each sample. Furthermore, the predicted probability is based on linear predictors calculated using the following formula:

$$ \begin{aligned} Linear\_predictors = & Intercept + \mathop \sum \limits_{i = 1}^{n} Expression_{gene\_i} \times Logistic\_coeffieicent_{gene\_i} \\ Predicted\_probability = & \frac{{e^{Linear\_predictors} }}{{1 + e^{Linear\_predictors} }} \\ \end{aligned} $$

Model correlation tests were performed in R based on the ‘rms’ and’ ResourceSelection’ R packages. Variance inflation factors (VIF) were calculated to detect multicollinearity in the model. Model fit was assessed using a likelihood ratio test and calibration curve. C-index and receiver operating characteristic (ROC) curves were used to evaluate the predictive power of the model. The decision curve analysis (DCA) curve was used as an indicator of clinical effectiveness. Three independent cohorts, GSE49355, GSE68468, and GSE81585, were used to validate the generalization performance of the model.

Analysis of immune cell infiltration and GSEA enrichment analysis

CIBERSORT and ssGSEA were used to quantify relative immune cell infiltration levels. We performed the CIBERSORT algorithm using the ‘e1071,’ ‘preprocessCore’ and ‘BiocManager’ R packages and the ssGSEA algorithm by ‘GSVA’ R package. The Wilcoxon rank-sum test and Spearman’s correlation analysis were used to clarify the correlation between the expression of colorectal liver metastasis-specific genes and the level of immune cell infiltration. GESA analysis was performed to investigate the hallmarks of SPP1-positive group by ‘clusterProfiler’ R package16,22.

Cell culture and siRNA transfection

The human colon cancer cell line SW620, provided by Zhejiang Meisen Cell Technology Co., LTD, was cultured in DMEM with 10% FBS (Cegrogen) and 1% penicillin–streptomycin (Gibco). All the cells were maintained at 37 °C in a 5% CO2.

For knockdown studies of SPP1, siRNA and GP-transfect-mate were purchased from GenePharma (Shanghai, China). The siRNA sequences used for knockdown are detailed in Supplementary Table 1.

Quantitative real-time PCR (qRT-PCR) was performed to detect the knockdown efficiency using SYBR Green Supermix (TIANGEN Biotech) on an ABI7500 thermocycler. The ΔΔCt method was used to calculate the relative expression levels of the target mRNA. Primers for the internal reference gene and target gene were designed by Sangon Biotech (Shanghai, China) (Supplementary Table 2).

Migration or invasion assays

For transwell assays, cells were plated in the upper chamber of a transwell (Corning Cat). Cells in the upper chamber were placed in serum-free medium and medium containing 30% FBS was added to the lower chamber. To test invasive ability, Matrigel (LYNJUNE) was evenly spread on the upper chamber surface of the small chamber membrane to form a thin film.

For the wound healing assay, cells were plated in a six-well plate, and when the cells grew adherently, they were scratched with a 200μL sterile pipette tip, and the shed cells were washed away with PBS. Serum-free medium was added to continue the culture and the width of the scratch was recorded at 0, 24, and 48 h.

Cell proliferation assays

For the EdU assay, cells were plated in six-well plates and incubated with EdU working solution (APExBIO). After fixation and permeabilization, the EdU reaction complex was added, incubated at room temperature, and protected from light for 30 min. Nuclei were counterstained with Hoechst33342 dye and detected, and photographed using fluorescence microscopy.

Cell counting kit-8(CCK-8) was purchased from MedChemExpress. Cells were diluted into a suspension of 2*10^3 cells/100 μl, and 100 μl of the cell suspension was added to each well in a 96-well plate, with five sub-wells in each group. Add 10ul CCK-8 at 0, 24, 48 and 72 h. Incubate in the incubator for 1.5 h and measure the absorbance (450 nm) using a microplate reader (ThermoFisher).

For the colony formation assay, 500 cells per well were plated in a 6-well plate. The medium was changed every 3–4 days. Fixation and staining were performed after 14 d.

Cell apoptosis assays

For the cell apoptosis assay, flow cytometry (Beckman, FC500) was performed to detect apoptotic cells using an Annexin V-FITC/PI Apoptosis Kit (APExBIO).

Statistical analysis

R was used for all statistical analyses. To analyze the differences between the two datasets, we performed normality tests on the data. For samples that passed the normality test, a paired sample t-test was chosen, and for samples that failed the test, the Wilcoxon rank-sum test was chosen. Statistical significance was set at a two-tailed p-value of 0.05. Spearman’s correlation test was used for the co-expression analysis of genes and immuno-infiltration correlation analysis.

Informed consent

Informed consent was obtained from all individual participants included in the study.

Result

Screening of colorectal liver metastasis-specific genes

To investigate the factors that contribute to the progression of primary colorectal cancer to metastases, we analyzed the differences in expression profiles between primary tumors and liver metastases in the GSE41258 cohort. We found 148 upregulated and 214 downregulated probes with statistical significance in liver metastases (Fig. 2A). A total of 204 differentially expressed genes (DEGs), covering 91 upregulated genes and 113 downregulated genes, were obtained after removing duplicate values and multiple annotated probes (Fig. 2B). We referred to these DEGs as colorectal liver specific genes.

Figure 2
figure 2

Screening of colorectal liver metastasis-specific genes. (A) Volcano plot of differentially expressed genes (the red dots are 148 up-regulated probes and the blue dots are 214 down-regulated probes). (B) Heat map of 204 differentially expressed genes (DEGs), covering 91 up-regulated genes (upper part) and 113 down-regulated genes (bottom part). GO enrichment analysis on 91 up-regulated (C) and 113 down-regulated liver metastasis-specific genes (D). Protein–protein interaction network was constructed by STRING based on 204 metastasis-specific genes (E), of which hub genes were screened by the ‘cytoHubba’ plugin, involving the ‘MCC’ (F) and ‘EPC’(G) algorithms.

Potential biological function of liver metastasis-specific genes

We performed GO enrichment analysis of 91 upregulated and 113 downregulated liver metastasis-specific genes. The upregulated liver metastasis-specific genes were enriched in the vesicle lumen (GO:0031983), endoplasmic reticulum lumen (GO:0005788), and secretory granule lumen (GO:0034774)’ (Fig. 2C and Supplementary Table 3). While, the down-regulated liver metastasis-specific genes were significantly associated with extracellular matrix-related process, such as ‘Collagen-containing extracellular matrix (GO:0062023)’, ‘Extracellular matrix organization (GO:0030198)’ and ‘Extracellular structure organization (GO:0043062)’ (Fig. 2D and Supplementary Table 4).

Next, we constructed a PPI network for colorectal liver metastasis-specific genes, which was built with 190 nodes and 214 edges, with an average node dimension of 2.25 (Fig. 2E). The screening of hub genes were performed by the ‘cytoHubba’ plugin, involving the ‘MCC’ and ‘EPC’ algorithms (Fig. 2F–G). Based on these two scores, APOH, FGG, AMBP, and FGA were selected as the central genes of the PPI network.

Identification of the hub liver metastasis-specific genes based on machine learning

Four machine-learning algorithms were applied to identify hub liver metastasis-specific genes. The results of the Lasso algorithm suggested that the model fit best when lambda.min was 15 (Fig. 3A). The genes included in the Lasso model were SPP1, ZG16, PIGR and PRKAR2B, etc. For the SVM-RFE algorithm, the root mean square error (RMSE) of the ten-fold cross-validation was minimized when the number of characteristic genes was 19, and the screened genes included CYP2E1, SPP1, ZG16, and PIGR (Fig. 3B). For the RF algorithm, we set the number of iterations of the random forest classifier to 500 so that the out-of-bag (OOB) error was stable and less than 0.05 (Fig. 3C). Twelve genes, including SPP1, PRKAR2B, and ZG16, were identified as liver metastasis-specific genes using the RF algorithm. For the XGboost algorithm, the histogram of the importance of feature genes suggested that C9 ranked first, followed by PRKAR2B, CYP2E1, SPP1, and P2RY14 (Fig. 3D).

Figure 3
figure 3

Identification of the hub liver metastasis-specific genes based on machine learning. (A) The lasso algorithm determined 15 feature genes based on lambda.min values. (B) The SVM-RFE algorithm determined 19 feature genes based on RMSE. (C) The RF algorithm determined 12 feature genes based on OOB error and MDG. (D) The XGboost algorithm determined 18 feature genes and ranked them by importance. (E) Venn diagram was plotted overlapping the results of the four algorithms described above, obtaining nine genes. Lasso, Least absolute shrinkage and selection operator; SVM-RFE, Support vector machine- recursive feature elimination; RMSE, The root mean square error; RF, Random forest; OOB, Out of bag; MDG, Mean decrease Gini;

Finally, nine intersecting genes were identified: CYP2E1, PRKAR2B, PIGR, C5, TPSAB1, P2RY14, C9, SPP1, and ZG16 (Fig. 3E and Table 1). To obtain the optimal model, we randomly split GSE41258 into a training set and validation set in a ratio of 7:3, and stepwise logistic regression was performed. Four genes (PRKAR2B, P2RY14, SPP1, and ZG16) were identified as hub colorectal liver metastasis-specific genes.

Table 1 Identification of the hub liver metastasis-specific genes based on machine learning.

Model development and validation for identifying colorectal liver metastasis

Univariate logistic regression analysis showed good discrimination between liver metastases and primary colorectal tumors by hub colorectal liver metastasis-specific genes. Increased expression of SPP1 (OR = 1.554, 95%CI = 1.187–2.035, P = 0.001) was a risk factor for the development of colorectal liver metastases, while ZG16 (OR = 0.425, 95%CI = 0.292–0.619, P < 0.001), P2RY14 (OR = 0.052, 95%CI = 0.018–0.147, P < 0.001), and PRKAR2B (OR = 0.016, 95%CI = 0.004–0.070, P < 0.001) showed the opposite trend (Table 2).

Table 2 Univariate and multivariate logistic regression for predicting CRLM with genetic markers.

We further performed multivariate logistic regression analysis (Table 2) and developed a model. The regression coefficients of the genes (Supplementary Table 5) were used to calculate the linear predictors of each sample to identify colorectal liver metastasis. The model is shown in Fig. 4A, and its performance is detailed in Supplementary Table 6. The VIF of all four genes included in the model was less than 3, suggesting that there was no serious multicollinearity (VIFSPP1 = 1.363, VIFZG16 = 1.674, VIFPRKAR2B = 1.358, and VIFP2RY14 = 2.412).

Figure 4
figure 4

Model development and validation for identifying colorectal liver metastasis (A) Nomogram for identifying colorectal liver metastasis based on four hub liver metastasis-specific genes by stepwise logistic regression. (BG) The Validation of the efficacy of identifying colorectal liver metastasis by calibration curve, ROC, and DCA curve in GSE41258 (B, Training set, AUCLinear predictors = 0.987; C, Validation set, AUCLinear predictors = 0.968; D, All set, AUCLinear predictors = 0.983), GSE49355 (E, AUCLinear predictors = 0.942), GSE81558 (F, AUCLinear predictors = 0.961), and GSE68468 (G, AUCLinear predictors = 0.873). ROC, Receiver operating characteristic; DCA, Decision curve analysis.

The chi-square value for the likelihood ratio test was 124.8 (P value < 0.001). The Hosmer–Lemeshow goodness-of-fit statistic showed no significant differences (Chi-square = 1.2479, P = 0.9961). The apparent line was similar to the bias-corrected line and close to the ideal line (Fig. 4B). The C-index of the model was 0.987 (0.973–1.000), and the ROC and DCA curves suggested that the efficacy and net benefit of using the model to identify colorectal liver metastasis were better than those of using the hub gene discriminant (AUC Linear predictors = 0.987, AUCPRKAR2B = 0.934, AUCP2RY14 = 0.929, AUCSPP1 = 0.682, AUCZG16 = 0.815). The validation set and full set, as well as the independent cohort validation results of GSE49355, GSE81558, and GSE68468, suggested that the model fit well and the model efficacy was superior (Fig. 4C–G, Validation set: AUCLinear predictors = 0.968, All set: AUCLinear predictors = 0.983, GSE49355: AUCLinear predictors = 0.942, GSE81558: AUCLinear predictors = 0.961, GSE68468: AUCLinear predictors = 0.873).

Prognostic value of the liver metastasis-specific genes

As survival data from the GSE41258 cohort were imperfect, TCGA data were selected for prognostic analysis. Based on the univariate Cox regression analysis (Fig. 5A,B) and the level of differential expression between tumor and normal tissues (Fig. 5C), 204 liver metastasis-specific genes were filtered to obtain nine liver metastasis-specific prognostic genes. Five prognostic hub genes were identified by LASSO regression analysis (Fig. 5D), and risk scores were constructed (Fig. 5E, Supplementary Table 7). This risk score demonstrated a favorable prognostic value in the TCGA cohort (Fig. 5F) and the independent external validation cohort GSE39582 (Fig. 5G).

Figure 5
figure 5

Prognostic value of the liver metastasis-specific genes. (A) Univariate cox regression analysis was performed to screening prognostic liver metastasis-specific genes. (B) KM curve for predicting OS based on prognostic genes expression grouping. (C) Expression of prognostic liver metastasis-specific genes between normal and tumor in TCGA by unpaired samples. (D) A total of 5 prognostic hub genes were identified by lasso regression analysis and risk scores were constructed. (E) Distributions of risk scores, survival status and expression of 5 prognostic hub genes. KM curves of OS between low-risk and high-risk groups in TCGA (F, Training set) and GSE39582 (G, Testing set). (H) Prognostic genes with discriminatory efficacy for colorectal liver metastasis and primary tumor were defined as colorectal liver metastasis-specific genes, which were SPP1, ZG16 and P2RY14.

Eventually, we identified the prognostic genes with discriminatory efficacy for colorectal liver metastasis and primary tumors, defined as the hub colorectal liver metastasis-specific genes SPP1, ZG16, and P2RY14 (Fig. 5H).

Association of the hub liver metastasis-specific genes with immune infiltration and pathway enrichment analysis

We analyzed the relative infiltration of 22 immune cells in tumor samples from the GSE41258 cohort using the CIBERSORT algorithm. The levels of M0 and M2 macrophage (R = 0.359, P < 0.001) infiltration were higher in SPP1-positive tumors (Fig. 6A). Data from TCGA database similarly showed that the expression of SPP1 was positively correlated with the level of infiltration of macrophages (R = 0.792, P < 0.001, ssGSEA, Fig. 6B).

Figure 6
figure 6

Association of the hub liver metastasis-specific genes with immune infiltration and pathway enrichment analysis. Correlations between SPP1 and tumor-infiltrating immune cells by CIBERSORT algorithm (A) and ssGSEA algorithm (B). (C) Association between SPP1 and M2 macrophage surface markers. Association between SPP1 and macrophage secretions, including VEGF (D) and TGF-β (E). (FG) Ridge map of GSEA for DEGs from SPP1-positive tumors compared with SPP1-negative tumors.

Furthermore, we assessed the relationship between SPP1 and M2 macrophage surface markers (Fig. 6C) and found that the expression levels of CD163, CD115(CSF1R), PDL2(PDCD1LG2), and CD206(MRC1) positively correlated with the expression level of SPP1.

Moreover, we evaluated the relationship between SPP1 and M2 macrophage secretion, including VEGF and TGF-β (Fig. 6D,E). The expression levels of most M2 macrophage secretions (including VEGF-B, VEGF-C, VEGF-D, TGF-β1, TGF-β2, and TGF-β3) were higher in SPP1-positive tumors.

In addition, GESA was performed to investigate the potential biological mechanisms by which SPP1 affects tumor progression and metastasis. Immunomodulatory and extracellular matrix remodeling-related pathways were significantly enriched in SPP1-positive tumors (Fig. 6F,G).

Cytologic experimental validation of hub colorectal liver metastasis-specific genes (SW620/SPP1)

From the above bioinformatic analysis, it was found that SPP1 was significantly elevated in patients with colorectal liver metastasis and correlated with poor prognosis. SW620 cells originated from lymph node metastases, which are highly invasive and metastatic, and can rapidly spread to the liver We therefore performed cytological experiments using SW620 cells to explore the role of SPP1 in tumor metastasis and progression.

We knocked down SPP1 by using siRNA with the highest knockdown efficiency (Supplementary Fig. 1). Transwell and wound healing assays showed that the downregulation of SPP1 significantly attenuated the migratory capacity of SW620 cells23,24 (Fig. 7A–C). The results of the EdU (Fig. 7D, Supplementary Fig. 2), CCK-8 (Fig. 7E), and colony formation assays (Fig. 7F) were consistent, suggesting that SPP1 promoted the proliferation of SW620 cells. The flow cytometry results (Fig. 7G) provided evidence that SPP1 knockdown led to increased apoptosis.

Figure 7
figure 7

Cytologic experimental validation of hub colorectal liver metastasis-specific genes (SW620/SPP1). The cellular migration assay of SW620 cells treated with si-NC and si-SPP1 by transwell assays (A) and wound healing assay (B, C). The cellular proliferation assay of SW620 cells treated with si-NC and si-SPP1 by EdU assay (D), CCK-8 assay (E) and colony formation assay (F). The cellular apoptosis assay of SW620 cells treated with si-NC and si-SPP1 by flow cytometry (G).

Discussion

There remain challenges in the diagnosis and management of colorectal cancer liver metastases25. Clinical diagnosis of colorectal liver metastasis is usually made with the help of postoperative pathologic results, including the pattern of metastasis and the expression of CDX2 and SATB2 detected by immunohistochemistry5,26,27,28. If CDX2 and SATB2 are positive, there is a high confidence that the metastases originate from the colorectum4,29. In other words, liver metastases from colorectal cancer retain specific features of colorectal tumors.

In recent years, numerous studies have suggested reprogramming as a hallmark of tumor progression and metastasis, such as transcriptional and metabolic reprogramming30,31,32. Transcriptional reprogramming is the phenomenon of overall changes in gene expression caused by transcription factors33,34. In colorectal liver metastasis, FOXA2 and HNF1A promote liver metastasis by combining with epigenetic enhancers in liver metastatic cells, thereby inducing liver-specific transcription6. Metabolic reprogramming refers to the mechanism by which cells promote cell proliferation and growth by altering their metabolic patterns in order to satisfy their energy needs, which not only helps the cells to resist external stress but also confers new functions on the cells35,36. The Warburg effect in tumorigenesis is one of the most widely studied cellular metabolic reprogramming modes, whereby tumor cells rapidly proliferate by altering their energy metabolism to adapt to the microenvironment of hypoxia, acidity, and nutrient deficiencies37. In colorectal liver metastasis, MTA1 mediates mitochondrial metabolic reprogramming and drives liver metastasis by increasing oxidative phosphorylation and ATP production38,39. Differences in creatine metabolism between metastatic and primary colorectal tumors mediate SMAD2/3 phosphorylation, which promotes liver metastasis40. Hence, the concept of reprogramming suggests that colorectal liver metastases have a characteristic gene expression pattern different from that of primary tumors.

To clarify this seemingly paradoxical phenomenon, we screened and identified colorectal liver metastasis-specific genes using a machine-learning approach to investigate the mechanism of metastasis. In this study, we screened 204 DEGs between metastatic and primary tumors from the GSE41258 cohort. Nine colorectal liver metastasis-specific genes were identified using four machine learning methods. GSE41258 was divided into training and validation sets in the ratio of 7:3, and based on stepwise logistic regression analysis, four hub liver metastasis-specific genes, SPP1, P2RY14, ZG16, PRKAR2B, were finally identified. We developed a CRLM identification model based on 4-hub colorectal liver metastasis-specific genes, and the model efficacy was validated using the validation set, full set, and external cohorts GSE49355, GSE81558, and GSE68468 using ROC, calibration, and DCA curves. We explored the prognostic value of colorectal liver metastasis-specific genes in TCGA database due to imperfect patient survival information in the GEO database. Five key liver metastasis-specific prognostic genes, SPP1, ZG16, P2RY14, CYP2E1, and C5, were identified using the LASSO-COX algorithm, and a risk score was constructed. The risk score was validated in the GSE39582 cohort.

Three genes have both efficacies in identifying liver metastasis and prognostic value, namely, ZG16, P2RY14, and SPP1. Evidence in the literature suggests that ZG16 is associated with distant and lymphatic metastasis, reduces PD-L1 expression, and may serve as a biomarker of immunotherapy benefit41, consistent with the bioinformatics analysis in this study. Furthermore, ZG16 inhibits stemness in colorectal cancer cells, thereby delaying colorectal cancer progression42. P2RY14 acts as a cancer suppressor in various tumors, such as ovarian cancer43, endometrial cancer44, lung cancer45 and colorectal cancer46,47, but most of these studies were based on bioinformatics analysis. In vitro experiments have demonstrated that activation of the P2Y14 receptor reduces IL-6 production, resulting in the inhibition of glioma progression48. SPP1 has been implicated in the PI3K/AKT pathway49 and β-catenin50 and mediates the epithelial-to-mesenchymal transition process in in vitro experiments of colorectal cancer. However, the cell lines selected for these studies, such as DLD1 and SW480, were non-metastatic. Therefore, the highly invasive and metastatic SW620 cell line was selected for in vitro experiments using SPP1. The results indicated that the knockdown of SPP1 gene expression could repress the proliferation and migration ability of SW620 cells, suggesting that SPP1 might act as a cancer-promoting factor in colorectal cancer.

To investigate the mechanism of SPP1 in the development of colorectal liver metastases, we performed immune infiltration analysis. The results showed a positive correlation between SPP1 and macrophage infiltration, especially M0 and M2 macrophages. Macrophages are a diverse group of leukocytes, with M0 macrophages forming M1 macrophages in response to activation by LPS and Th1 cytokines, such as IFN-γ and TNF-α, and M2 macrophages in response to activation by IL-4, IL-13, IL-10, IL-33, and TGF-β51,52. M1 and M2 macrophages infiltrating solid tumor tissues are referred to as tumor-associated macrophages (TAM), in which M2 macrophages with pro-tumorigenic activity are dominant53,54. Macrophages are extremely plastic, and M2 macrophages can be polarized into M1 macrophages with anti-tumor activity when stimulated by external conditions55. Based on this theoretical background, Egeblad reported an anti-tumor approach by reprogramming TAM56. Surface markers of macrophages are important for distinguishing M1 macrophages from M2 macrophages52. We further investigated the relationship between SPP1 and the surface markers of M2-type macrophages and found that the expression levels of CD163, CD115 (CSF1R), PDL2 (PDCD1LG2), and CD206 (MRC1) were positively correlated with the expression levels of SPP1. These results preliminarily validated the correlation between SPP1 and M2 macrophages. M2 macrophages can be divided into four cell subpopulations, M2a, M2b, M2c, and M2d, depending on the factors activated, in which VEGF secreted by cells of the M2d subpopulation plays a role in promoting angiogenesis and tumor progression51. To investigate whether SPP1 is involved in the pro-carcinogenic process of M2 macrophages, we evaluated the relationship between SPP1 and the VEGF family. The results showed that VEGF-B, VEGF-C, and VEGF-D had higher expression levels in SPP1-positive tumors. Moreover, we found a positive correlation between TGF-β secreted by M2c macrophages involved in apoptosis and the expression of SPP1. The results provide evidence that SPP1 may be involved in the pro-carcinogenic effects of M2 macrophages, which in turn affects tumor progression and metastasis.

Interestingly, the GSEA results demonstrated a correlation between SPP1 and scavenger receptor-associated immunoregulatory hallmarks, which was consistent with the immune infiltration analysis. Significant enrichment of SPP1 in extracellular matrix (ECM) remodeling-related pathways was also demonstrated by GSEA analysis. Tumor-associated ECM degradation is the main cause of normal tissue destruction by tumor cells and infiltrative growth of tumor tissues57. The ECM also plays an extremely important role in tumor immunity, and tumor-associated ECM has the potential to improve tumor immunotherapy58. Whereas not all patients with CRC respond well to immune checkpoint therapy in clinical care59,60, nutritional61,62 and metabolic63 status are important factors that influence the prognosis of these patients. For example, the Royal Marsden Hospital score, a prognostic tool that combines nutritional and inflammatory markers, has shown its potential to predict survival in colorectal cancer and many other cancer types64.

To summarize, we hypothesized that SPP1-positive colorectal cancer promotes M2 macrophage polarization and recruitment to the tumor microenvironment, leading to ECM remodeling, which in turn promotes the development of CRLM (Fig. 8). This study has some limitations. First, we designed this study as a retrospective analysis based on data from TCGA and GEO databases. It may not be possible to homogenize the collection institutions and sequencing methods of the samples. Second, only one hub colorectal liver metastasis-specific gene (SPP1) was validated in in vitro studies. The results of bioinformatics analysis of SPP1-promoted M2 macrophage polarization in the progression of colorectal liver metastasis require further experimental validation.

Figure 8
figure 8

The model for SPP1 in carcinogenesis of CRC.

Conclusion

In conclusion, machine-learning-based screening of colorectal liver metastasis-specific genes is feasible and accurate, and the models built using this method have good discriminatory efficacy. The hub CRLM-specific gene SPP1 can help determine the diagnosis, prognosis, and immune infiltration of patients with CRLM. However, understanding how SPP1 recruits macrophages in colon cancer and leads to ECM remodeling to mediate tumor metastasis may be a direction for future research.