Introduction

Heart failure (HF) represents the mid-to-late stage of various cardiovascular diseases, characterized by high incidence, mortality, and hospitalization rates. It is a significant global public health issue, leading to decreased quality of life and increased mortality among patients1. Notably, HF predominantly affects the elderly, with incidence rising notably after the age of 602. The aging process induces degenerative changes in cardiac structure and function, resulting in myocardial cell hypertrophy, increased cardiac afterload pressure, and subsequent inadequate systemic perfusion, leading to symptoms such as dyspnea and fatigue3. Furthermore, due to the factors like an aging population, widespread risk factors for cardiovascular diseases, and improved evidence-based treatments for HF patients, the HF incidence and readmission rates continue to rise. Therefore, prioritizing the improvement of the quality of life for elderly HF patients, seeking effective diagnostic and therapeutic measures to reduce readmission rates, and striving to alleviate the associated socioeconomic burden have become imperative for global public health reform.

As the understanding of the pathophysiological mechanisms of HF deepens, significant progress has been made in its treatment, with continuous advancements in both pharmacological and device-based therapies. Despite these advancements, there are still approximately 26 million cases of HF globally, with an estimated annual cost of $108 billion, accounting for 2% of the global healthcare budget, and this expenditure is projected to continue rising4. Furthermore, repeated hospital readmissions due to HF pose a significant challenge for global cardiovascular physicians and patients, presenting enormous psychological, economic, and physical burdens. They often feel discouraged, anxious, and helpless, and may even experience depression, leading to a vicious cycle of repeated hospitalizations5. Research indicates that approximately 18.2% of patients are readmitted within 30 days, and 31.2% within 90 days6. And, more than 50% of these patients encounter at least one readmission within 1-year7. Evidence-based medicine suggests that about a quarter of HF readmissions are preventable, and the economic burden on HF patients can be avoided by reducing readmission rates8. Consequently, researchers and policymakers have focused on efforts to reduce readmissions and avoid excessive healthcare expenditures. For instance, the United States introduced the Hospital Readmission Reduction Program (HRRP), which initially targeted HF among other conditions. This program measures hospital quality based on readmission rates and aims to lower readmission rates by economically penalizing hospitals with higher-than-average readmission rates9.

However, the incidence, prevalence, and survival rates of HF vary significantly across different countries and regions due to differences in genetics, cultural backgrounds, healthcare systems, and other factors. Additionally, there is a considerable variation in the levels of HF diagnosis and treatment among hospitals of diverse capabilities and resources10. For example, ischemic heart disease is predominant in North America and Europe, while valvular heart disease is more prevalent in the Asia–Pacific and East Asia regions11. Predictive tools for HF developed in other regions may have limited applicability. Consequently, it is crucial to develop precise 1-year all-cause readmission rate risk prediction models tailored to Chinese HF patients and explore their influencing factors. This is essential for determining clinically targeted interventions and developing personalized treatment plans to effectively reduce the readmission rate of elderly HF patients in China.

Assessing the risk of patient readmission is a challenging task that involves processing a large amount of clinical data and feature information. Traditionally, model features have been selected by experts, such as the widely used LACE model for predicting the 30-day readmission or mortality risk for HF patients12. In recent years, with the advancement of computer science, ML and artificial intelligence have demonstrated unique advantages in the prevention, diagnosis, and prognosis prediction of cardiovascular diseases due to their powerful ability to capture non-linear relationships13,14,15. For instance, Sharma compared ML models and the LACE score in predicting the 30-day unplanned readmission risk for HF patients. The study revealed that the XGBoost method can uncover more predictive factors, showing significant advantages over non-machine learning methods such as the LACE score16. In another study, Okabe et al.17 constructed a Cox proportional hazards model, indicating that worsening renal function (WRF) combined with elevated BNP at discharge is associated with increased cardiovascular mortality and HF readmission rates within 1 year. Furthermore, Chinese scholars developed an easy-to-use nomogram using a Cox regression model in a prospective cohort study of HF patients aged 65 and above, aiming to identify factors predicting 30-day and 1-year readmission or mortality for elderly HF patients in China18. Tong et al.19 compared the performance of traditional multivariate Cox regression models with three ML models at different time stages based on AUC value and Brier score. They found that models based on random survival forest and gradient boosting outperformed the traditional linear Cox model in terms of discriminative ability. Additionally, Rahman et al.20 conducted a study using electronic health record (EHR) data and trained a stacking ML model using results from three top-performing models. The model achieved an AUC of 0.881, indicating the effectiveness of the proposed model in predicting emergency readmissions for HF patients.

While current research demonstrates the widespread application of ML in medical researches, there remains a notable gap in the evidence-based application and interpretability of many ML risk prediction models when it comes to clinical disease prognosis15,21. Additionally, relying solely on ML algorithms for feature selection requires a significant amount of computational time and may overlooks the clinical expertise and specialized knowledge of cardiology experts. Conversely, relying solely on human experts for feature selection may result in the loss of valuable information. Furthermore, within the traditional statistical framework, high-dimensional non-linear relationships between clinical data and outcomes cannot be effectively identified22. To address these challenges, this study will compare models constructed based on feature sets selected by human experts and ML algorithms. By combining the strengths of both approaches in feature selection, a human–machine collaboration set is created to predict readmission within one year for elderly patients with HF. This method helps to gain a deeper understanding of the respective advantages and limitations of ML algorithms and human expert-selected features in constructing predictive models for readmission in elderly HF patients. Additionally, based on previous experience, this study applies SHAP values to the human–machine collaboration framework, providing intuitive explanations for the predictive results. This approach empowers clinical practitioners in more accurately assessing prognostic factors for the disease, thereby offering crucial insights for reducing patient readmission rates.

Methods

Research design

Initially, we identified elderly patients with HF admitted to the Affiliated Hospital of North Sichuan Medical College from January 1, 2018, to December 31, 2021. We collected their demographic details, clinical complications, laboratory results, and medication records during their hospitalization. Subsequently, we pre-processed the data using various techniques, including direct deletion, multicollinearity testing, the K-nearest neighbors (KNN) algorithm, and Min–Max normalization were applied. Following this, we employed a stratified random sampling approach to partition the patient data into a training set (70%) and a validation set (30%). We then assessed the importance of variables using three ML algorithms and identified the top 50 common features as the machine group, while the features chosen by two cardiology experts constituted the expert group. Next, we generated receiver operating characteristic (ROC) curves using 10 ML algorithms in the validation set and computed the AUC, F1-score, and Brier score to assess model performance. Subsequently, we utilized the Fast and Lightweight AutoML Library (FLAML) to optimize hyperparameters for prediction models with an AUC value exceeding 0.8. The common features selected by the expert and machine groups were integrated into the hyperparameter-optimized models. The performance metrics such as AUC, accuracy, sensitivity, recall, specificity, F1 score, and Brier score were calculated for each model. The entire process is depicted in Fig. 1.

Figure 1
figure 1

The flow chart of this study.

Data extraction

This retrospective cohort study utilized database from multiple systems including the information system, laboratory information management system, picture archiving and communication system, and electronic medical record system of the Affiliated Hospital of North Sichuan Medical College. Patients aged 65 and above, who were hospitalized with a primary diagnosis of HF from January 1, 2018, to December 31, 2021, were preliminary screened. The study variables included demographic information, clinical complications, physical examination, laboratory data, echocardiography results, and inpatient medication usage. Patient follow-up for unplanned readmissions within 1-year was conducted through database searches, phone communication, and outpatient visits.

Data pre-processing

Patients who passed away or were lost to follow-up during the observation period were excluded from the study. To mitigate the impact of missing data, feature variables with a missing rate of 30% or more were removed, while those with a missing rate of less than 30% were filled using the KNN algorithm (Fig. S1). Additionally, the correlation between variables was assessed using Pearson and Spearman tests, and any variable with a correlation coefficient of 0.8 or higher were removedto avoid multicollinearity. To address scale differences between different evaluation metrics and ensure result reliability, Min–Max normalization was applied to scale the data, enhancing comparability across data indicators. The dataset was then randomly divided into a training set and a validation set in a 7:3 ratio. The training set was utilized for develop models, while the validation set was used to evaluate models performance.

Feature selection and model construction

  1. 1.

    Machine Group In order to avoid overfitting and prediction bias of traditional statistical methods and ensure the stability of variable characteristics, XGBoost, random forest (RF), and LASSO regression analysis were employed. The top 50 common variables were selected as independent risk factors for all-cause readmisson in elderly HF patients.

  2. 2.

    Expert Group Two cardiologists, based on clinical expertise and review of relevant published predictive models, further evaluated the clinical importance of these variables from the 73 feature variables selected after collinearity analysis. This assessment aimed to determine the risk of readmisson in elderly HF patients. Importantly, this process was conducted without seeing the machine feature set.

  3. 3.

    Human–Machine Collaboration Group The variables selected by both the three ML algorithms and the experts were included as the final included features for the human–machine collaboration group. Ten ML models: including KNN, logistic regression (LR), support vector machine (SVM), RF, gaussian naive bayes (GaussianNB), decision tree (DT), gradient boosting decision tree (GBDT), light gradient boosting machine (LightGBM), extreme gradient boosting (XGBoost), and categorical boosting (CatBoost), were used for modeling. Various metrics: such as AUC value, F1-score, and Brier score, were calculated for each model. Subsequently, for the human–machine collaboration model with an AUC value greater than 0.8 in the training dataset, Based on “FLAML” framework, the optimal combination of hyperparameters is found by setting the iteration times of random search to 10, performing fivefold cross-validation, fitting training data and labels, and performing random search and cross-validation to improve the predictive performance of the model.

After the above steps, we can build 30 kinds of prediction models, and the modeling method is shown in Fig S2.

Model evaluation and result interpretation

In this study, conventional model evaluation metrics such as AUC value, accuracy, precision, recall, sensitivity, and specificity, along with the Brier score and F1-score were included to provide a comprehensive assessment of model performance. The F1-score, derived from the harmonic mean of precision and recall, offers a robust measure of model performance, with higher score indicating greater robustness. The Brier score measures the model's calibration, ranging from 0 to 1, with lower value signifying heightened predictive accuracy. To further compare the performance of models constructed by different methods, we used the DeLong test to assess the significance differences in AUC between different models23. Furthermore, the interpretation the results of ML predictive models and the elucidation of prediction importance present ongoing challenges. To address this, we utilized SHAP, a post-hoc explanation technique, to generate SHAP values for each input feature, quantifying its impact on predictions. Consequently, our study integrated the top-performing predictive model with the SHAP framework to enhance result interpretation.

Ethical approval

This study was conducted in accordance with the Declaration of Helsinki. The study protocol has been approved by the ethics committee of the Affiliated Hospital of North Sichuan Medical College (2023ER331-1).Considering this study solely involved the analysis of patients' historical clinical data, informed consent was waived by “Affiliated Hospital of North Sichuan Medical College” Ethics Committee.

Statistical analysis

All statistical analysis and calculations were performed using Python (version 3.11 and the following packages: “pandas”, “os”, “numpy”, “warnings”, “sklearn”, “CatBoost”, “lightGBM”, “matplotlib”, “XGBoost”, “FLAML”, and “SHAP”. R language (version 4.2.3) and the “CompareGroups”, “tableone”, and “glmnet” packages were also utilized for specific analyses. Continuous variables follows a non-normal distribution is represented as [M(P25,P75)]. Group comparisons were represented by Mann–Whitney U non-parametric test. The counting is represented as frequency and percentage (n, %), and between-group comparisons were conducted using the χ2 test. All tests were two-sided and P values < 0.05 were considered statistically significant.

Result

Patient characteristics

After excluding 821 deceased patients and 253 patients lost to follow-up, a total of 8396 elderly patients with HF were included in this study. Among them, 3978 patients (47.3%) experienced unplanned readmission within 1-year. The average age of the patients was 74.00 years, with 4653 male patients (55.4%) and 3743 female patients (44.6%). Furthermore, the average HGB level for readmitted patients was 106.000 g/L (95% CI 87.000–124.000), which was lower than the non-readmission group at 121.000 g/L (95% CI 108.000–133.000) (P < 0.001). The average N-terminal pro-B-type natriuretic peptide (NT-proBNP) level for readmitted patients was 2169.853 pg/mL (95% CI 1525.750–3364.000), which was significantly higher than the non-readmission group at 1355.100 pg/mL (95% CI 857.775–2738.789) (P < 0.001). No differences were observed between the groups in terms of gender and history of hypertension. Table 1 presents a detailed comparison of clinical characteristics between the readmission and non-readmission groups.

Table 1 Baseline characteristics of older patients with heart failure.

Feature variable selection

This study initially included 87 variables based on database information. Subsequently, 8 feature variables with a missing rate greater than 30% were removed. After conducting Pearson and Spearman correlation tests, and based on clinical experience, 6 feature variables with correlation coefficients ≥ 0.8 were also removed. Following these steps, a total of 73 features were included in the study, covering basic personal information (n = 8), vital signs (n = 6), medical history (n = 5), laboratory indicators (n = 40), echocardiography data (n = 11), and in-hospital medication use (n = 3).

  1. 1.

    Machine group In order to ensure the reliability and robustness of the included features, this study selected the top 50 variables in terms of importance scores from XGBoost, RF, and LASSO regression analyses as the final prognostic-related features for the machine group. Figure 2 shows the top 30 features. Among the variables selected by RF (Fig. 2A), the importance of NT-proBNP and HGB was significantly higher than other indicators. In the XGBoost algorithm (Fig. 2B), the smoking history was ranked as the most important, followed by NT-proBNP, diuretic, HGB, and NYHA classification. LASSO regression, through the addition of penalty functions and continuous coefficient compression to select relevant variables, the results show that HGB is the most importance among the variables, while the performance of NT-proBNP was poor, ranking fifteenth (Fig. 2C). A total of 32 common variables were selected.

  2. 2.

    Expert group Based on their clinical experience and knowledge, the experts selected 25 variables, including age, gender, length of stay, medical history, blood indicators, renal function, blood lipids, in-hospital medication history, echocardiography data, and HF biomarkers. Notably, these features mainly focused on demographic information and laboratory data, excluding information on patients' physical examinations (Table S1).

  3. 3.

    Human–machine collaboration group The machine group and the expert group combined their selected feature variables to form a collaborative group. The collaboration group comprised a total of 46 features, including 11 common features, 21 unique variables from the machine group, and 14 from the expert group. In terms of the categories of selected variables, the expert group had slightly fewer variables than the machine group, but the information included for patients was more balanced (Table S1).

Figure 2
figure 2

Importance ranking of features in three ML algorithms. (A) Random forest, (B) XGBoost, (C) LASSO regression.

Performance of expert group, machine group, and human–machine combination group models

Subsequently, the performance of the models was assessed using ten ML algorithms on the validation set, based on the three groups of research variables mentioned earlier. The corresponding ROC curves for each model are presented in Fig. 3. Moreover, various metrics were calculated to comprehensively evaluate the models' performance (Table S2). Notably, among the human–machine collaboration models, five ML algorithms exhibited an AUC value exceeding 0.8, namely RF (0.81575), XGBoost (0.81529), LightGBM (0.82921), GBDT (0.83213), and CatBoost (0.83447), as depicted in Fig. 3C. Upon selecting these five models and utilizing the “FLAML” framework to optimize the hyperparameters, their performance improved, resulting in higher AUC values of RF (0.82436), XGBoost (0.83272), LightGBM (0.83341), GBDT (0.83300), and CatBoost (0.83617), as shown in Table 2 and Fig. 3D. Comparatively, CatBoost demonstrated the highest AUC value, the lowest Brier score, and exceptional performance in terms of sensitivity, accuracy, specificity, and F1-score. Therefore, CatBoost emerged as the best ML algorithm in this study.

Figure 3
figure 3

ROC curves were constructed by three groups using ten different machine learning algorithms. (A) Expert group, (B) machine group, (C) pre-optimization human–machine collaboration group, (D) post-optimization human–machine collaboration group.

Table 2 Five optimized ML algorithms in a human–machine collaboration group.

Performance comparison of three group models

In terms of model performance, some models developed by the expert group demonstrate higher accuracy and sensitivity compared to those of the machine group. However, overall, the machine group's performance was generally superior to that of the expert group (Table S2A,B). Specifically, the AUC value of CatBoost algorithm in the machine group surpasses that of the expert group (Z value = 3.00058, P < 0.05). Similarly, the AUC value achieved by the CatBoost algorithm in the human–machine collaboration group significantly outperforms that of the expert group (Z value = 4.10028, P < 0.05). Additionally, the CatBoost model developed by the human–machine collaboration group exhibits slightly higher AUC value than the machine group, along with notably improved accuracy, sensitivity, specificity, and F1-score, and a lower Brier score (Table 3). Consequently, it can be inferred that the performance of the CatBoost model developed by the human–machine collaboration group is the best among the three groups.

Table 3 Performance comparison of CatBoost prediction models constructed by three groups.

Model result interpretation

In this study, the SHAP framework was employed to compute importance scores (SHAP values) based on the intrinsic attributes of ML algorithms. The magnitude of SHAP values is utilized to evaluate the contribution of each predictive feature for patient readmission, offering an intuitive explanation for the study’s predictive outcomes. Figure 4 presents the top 20 variables sorted by SHAP values in the CatBoost model developed through human–machine collaboration. The feature importance decreases in descending order, and the top 5 variables in terms of importance were recognized by both the expert and machine groups. To visually illustrate the impact of variables, the study provides two typical examples to demonstrate the model's interpretability, the features indicated by the red arrows increase the risk of readmission within one year for elderly HF patients, while those indicated by the blue arrows decrease the risk of readmission. As depicted in Fig. 5, for a high-risk readmission patient with a base value of 0.144, the predicted probability of unplanned readmission within 1 year is 91.96%, accompanied by an output SHAP value of 2.44. Conversely, for a low-risk readmission patient, the predicted probability is 14.42%, with an output SHAP value of -1.78. Fig. S3 presents a visual combination force plot of interpretation for top 200 samples, arranged side by side according to sample clustering similarity, with the Y-axis corresponding to the X-axis of single force plot, the trend of change through the broken line indicates that the readmission risk for the first 50 patients is lower than that of the last 150 patients. Furthermore, Fig. 6 illustrates the impact of the top 5 variables on the CatBoost model predictions under the influence of the variable with the strongest interaction effect. For example, Fig. 6A shows that HGB and HCT interact most frequently, exhibiting a negative trend between HGB and the predicted outcome. As HGB increases, SHAP values show a significant decreasing trend, indicating a lower risk of readmission. Additionally, as illustrated by the color gradient in the figure, individuals with a high HCT and an HGB of less than 100 generally have larger SHAP values than those with a low HCT, indicating a higher risk of readmission. The opposite is true for individuals with HGB greater than 100. Figure 6D shows that for NYHA class 1 and 2, NT-proBNP levels are relatively lower, corresponding to lower SHAP values. However, for NYHA class 3 and 4, higher NT-proBNP levels (more red dots) are observed, reinforcing the increased risk of readmission. Additionally, we demonstrate the output of SHAP value under the interaction of HGB and NT-proBNP (Fig. S4), the color gradient shows that individuals with higher NT-proBNP levels (purple to red dots) tend to have higher SHAP values at lower HGB levels. Conversely, at higher HGB levels, the dots are mostly blue, indicating lower NT-proBNP levels, which correspond to lower SHAP values, indicating a lower risk of readmission.

Figure 4
figure 4

SHAP interpretation of CatBoost model constructed by human–machine collaboration group. (A) The importance ranking of the model prediction features. The horizontal coordinate represents the SHAP values, the larger SHAP value indicates that the variable is more important; (B) Each point represents a feature value, and different colors represent the final influence of the feature on the model output results, where red represents a larger value and blue represents a smaller value.

Figure 5
figure 5

Individualized predictive outcomes for two patients. (A) High-risk patient, (B) Low-risk patient.

Figure 6
figure 6

SHAP dependence plot for key variables. The impact of the top five variables (HGB, NT-proBNP, Smoking history, NYHA classification, LVEF) ranked by importance, along with the variables most associated with their respective interactions on the SHAP values. Each point represents an individual patient in the dataset.

Discussion

Heart failure is a disease that poses a threat to global public health security and exhibits epidemiological variances and risk factors across different countries and regions. The characteristics of HF in developing countries differ from those in the western world, leading to differences in incidence and readmission rates24. In China, both the prevalence and incidence of HF increase with age, with an average per capita hospitalization cost of $4406.8. Furthermore, 40.5% of hospitalized patients have been admitted more than three times, indicating the persistent high mortality and readmission rates among heart failure patients25. Consequently, an interpretable human–machine collaboration predictive model has been developed, showing promise in predicting the risk of readmission for elderly HF patients within 1-year.

Machine Learning (ML), as a branch of artificial intelligence, encompasses a multidisciplinary field involving statistics, probability, and computer science. It plays a significant role in data preprocessing, statistical analysis, and predictive performance by learning from and making predictions and judgments based on large amounts of computer data26. Common ML algorithms are primarily divided into two categories: supervised learning and unsupervised learning. This includes methods like KNN, Artificial Neural Networks (ANN), SVM, DT, and RF, among others27. ML has been widely applied in the medical field, with multiple studies indicating that it can assist healthcare professionals in disease diagnosis, prognosis analysis, and risk stratification from complex medical data, thereby identifying high-risk factors in patients and establishing preventive measures28,29,30. However, it is important to note that the purpose of ML is to assist physicians in prognostic assessment and efficacy analysis of diseases, rather than to replace medical professionals as independent predictive tools31. The results obtained still require evaluation and optimization by qualified physicians, rather than blindly relying on algorithmic outcomes.

Nevertheless, current research primarily focuses on exploring predictive factors for various diseases through the combination of ML algorithms and clinical data, with few studies comparing or combining features selected by human experts with those chosen by ML32,33. To address this gap, our study aims to explore the performance of ML methods in predicting readmission of HF patients in collaboration with human experts, constructing a human–machine collaboration model to predict the likelihood of readmission within 1-year for elderly HF patients.

In this study, we conducted a comprehensive technical review of the data, utilizing the KNN algorithm, Pearson and Spearman tests to address missing data and highly collinear variables. The processed data encompassed 73 features, covering a wide range of information such as patients’ demographic information, vital signs, laboratory tests, echocardiographic data, and comorbidities. However, due to the varying scales of these features, achieving comparability among variables posed a challenge. Direct analysis of the original data might disproportionately emphasize larger values while overlooking smaller ones, significantly impacting subsequent modeling performance. Therefore, we employed Min–Max normalization to standardize the raw data34. For models with an AUC value exceeding 0.8 in the human–machine collaborative construction, we further enhanced their performance utilizing “FLAML” library. This library operates on the principle of automatic machine learning, automatically selecting and adjusting machine learning algorithm hyperparameters on a given dataset to identify the best model, thereby enhancing the efficiency and accuracy of machine learning. Within the FLAML framework, the optimal model search not only takes into account model performance but also considers the time cost and computational complexity of model training, thererby optimizing computing resources35,36. Additionally, in the comparison of ten ML algorithms, the CatBoost algorithm demonstrated exceptional performance. CatBoost is an ensemble learning algorithm based on GBDT improvements, leveraging symmetric decision trees and ranking boosting methods to mitigate model overfitting and effectively handle categorical features, thereby enhancing prediction accuracy and generalization37. Our findings align with these advantages, indicating that CatBoost can effectively enhance model performance and optimize risk assessment decisions for HF patient readmissions compared to several other ML algorithms.

This study utilized three ML algorithms to select variables as machine group (n = 32), which had a wider range compared to variables selected by human experts group (n = 25), and the machine model’s performance on the validation set demonstrated improvement. However, the overall inclusion of risk factor categories remained relatively narrow. Through the human–machine collaboration model, the precision of human expert selection was combined with the depth of machine selection, thereby expanding the range and accuracy of prognostic indicators for HF from multiple perspectives. Additionally, the study not only focused on the most important AUC differentiating index but also calculated other indicators of model prediction ability, such as accuracy, sensitivity, specificity, F1-score, and Brier score, compared with the 1-year heart failure readmission risk model construct other ML, the human–machine collaboration model perform more outstanding38,39. These results indicate that clinical practitioners should familiarize themselves with the basic concepts and metrics of ML and embrace the idea of constructing predictive models through a combination of human expert and machine selection.

In previous studies, the interpretability of ML models has been a significant obstacle. This study tackled this issue by employing SHAP for model interpretability. SHAP, introduced by Lundberg et al.40, is an algorithm designed to explain “black box” models. It calculates Shapley Values for each feature in the predictive model, offering consistent and accurate attributions for each feature to assess the importance of all feature combinations in reflecting their contribution to the predictive ability of the entire model. The SHAP force plot elucidates individual instances in the dataset, revealing how different features influence the predicted values, while the SHAP summary plot aggregates multiple features to provide a global view. By outputting the SHAP values for each feature sample in the dataset, visualization of the importance of different variable features for the model can be achieved. In this research, the SHAP framework was applied to the XGBoost model, a complex ensemble technique, revealing that HGB levels and NT-proBNP concentrations are the most important biomarkers for predicting readmission of elderly HF patients within 1-year. Smoking history, LVEF, and NYHA classification also emerged as key factors, with their feature values impacting patient readmission risk. Additionally, certain laboratory measurements such as PAD, LVDD, DBiL, FT4, and FBG, which have been infrequently reported in previous related prediction model studies, were found to make significant contributions to the prediction in this research. From a clinical prognosis perspective, these variables may become potential factors for HF patients’ readmission, warranting further investigation through clinical trials to validate their relevance.

Limitations

This study, despite offering significant contributions, also recognizes certain constraints that merit attention. Firstly, it is constrained to a single-center retrospective study in a specific region of China, underscores the need for external validation in diverse countries and regions to enhance the model’s generalizability on a global scale. Moreover, different subtypes of HF manifest variations in pathogenesis, clinical presentation, and treatment approaches. This investigation did not segregate and assess the prognosis of different types of HF individually, suggesting an area for refinement in future studies. Taking account for the distinct phenotypes of HF patients and incorporating these variations into subsequent model enhancements could significantly improve predictive accuracy. Additionally, deeper levels of deep learning for constructing medical risk models have been reported to yield improved predictive results, representing a promising avenue for future research.

Conclusion

In this study, we conducted a comparative analysis of models constructed using both human expert and machine selection methods to predict readmissions among elderly HF patients. Our findings underscored the effectiveness of combining variables selected by both methods to create a human–machine collaboration model, which encompassed a broader range of variable features and demonstrated superior predictive performance. Furthermore, leveraging advanced SHAP technology for model interpretation and visualization enabled a precise exploration of risk factors for readmission among elderly HF patients. This approach not only enhanced the model's credibility in identifying high-risk elderly HF patients but also developed personalized treatment plans to improve HF prognosis management.