Abstract
Forecasting of severe acute graft-versus-host disease (aGVHD) after transplantation is a challenging ‘large p, small n’ problem that suffers from nonuniform data sampling. We propose a dynamic probabilistic algorithm, daGOAT, that accommodates sampling heterogeneity, integrates multidimensional clinical data and continuously updates the daily risk score for severe aGVHD onset within a two-week moving window. In the studied cohorts, the cross-validated area under the receiver operator characteristic curve (AUROC) of daGOAT rose steadily after transplantation and peaked at ≥0.78 in both the adult and pediatric cohorts, outperforming the two-biomarker MAGIC score, three-biomarker Ann Arbor score, peri-transplantation features-based models and XGBoost. Simulation experiments indicated that the daGOAT algorithm is well suited for short time-series scenarios where the underlying process for event generation is smooth, multidimensional and where there are frequent and irregular data missing. daGOAT’s broader utility was demonstrated by performance testing on a remotely different task, that is, prediction of imminent human postural change based on smartphone inertial sensor time-series data.
Similar content being viewed by others
Main
To this day, severe acute graft-versus-host disease (grade III–IV aGVHD) remains a leading cause of death after allogeneic hematopoietic stem cell transplantation (allo-HSCT)—a last-resort treatment for many blood diseases—with a transplant-related mortality rate as high as ~30% within 100 days1.
Previous algorithms for forecasting severe aGVHD were usually based on peri-transplantation features (including recipient, donor and transplantation procedural parameters) or ‘landmark’ biomarker analysis (designating a specific time point, post-transplant, for plasma biomarker analysis). Area under the receiver operator characteristic curve (AUROC) scores of models using only peri-transplantation features were reported to be ~0.62, even when data from more than 20,000 patients were available2,3. For landmark analysis, it has been reported in at least some study cohorts that suppression of tumorigenicity 2 (ST2)—by far the most promising biomarker for predicting treatment response ‘after’ aGVHD onset4,5—had either no substantial association with aGVHD6 or a low AUROC (0.56) for forecasting severe aGVHD5 if measured at days 11 to 17 post-transplant (that is, ‘before’ aGVHD onset). One study of a Japanese cohort, however, did show association (AUROC 0.66) between grade II–IV aGVHD and ST2 measured on day 147. A two-biomarker model using ST2 and regenerating islet-derived 3-alpha (Reg3α) measured on day 7 post-transplant—the MAGIC score—was shown to predict six-month non-relapse mortality (NRM; AUROC 0.68)8. Forecasting NRM, however, was not equivalent to forecasting severe aGVHD. Using data reported by Hartwell et al.8 (their table S6), one could calculate that the proportion of NRM cases attributed to aGVHD was statistically indistinguishable between MAGIC score-stratified ‘high-risk’ and ‘low-risk’ groups (64% (41/64) versus 57% (47/83); P = 0.458, χ2 test). In contrast, if computed ‘at’ or ‘after’ aGVHD onset, multi-biomarker scores were shown to be efficacious in predicting treatment response and long-term outcome8,9,10.
Few studies have used time-series of all available patient information for forecasting severe aGVHD. One recent study applied penalized logistic regression to vital signs (body temperature, heart rate and so on) that were consistently recorded within the first 10 days after HSCT and achieved an AUROC value of 0.66 for forecasting grade II–IV aGVHD11. Modeling in HSCT—unlike in more common medical situations—is challenged by the small sample size (n), high feature number (p) and nonuniform data sampling12. This study thus aims to utilize evidence from all available dynamic variables to forecast severe aGVHD better.
We compiled and curated the post-transplant multidimensional time-series data of patients treated with human leukocyte antigen (HLA)-mismatched allo-HSCT using stem cells derived from peripheral blood, bone marrow or both at the Institute of Hematology, Chinese Academy of Medical Sciences (IHCAMS) between 2012 and 2021 (hereafter referred to as the ‘aGOAT’ (aGVHD Onset Anticipation Tianjin) dataset).
aGOAT contained data from 584 adult and 45 pediatric cases, and 16% of the adult cohort and 24% of the pediatric cohort suffered from severe aGVHD (Supplementary Table 1). There was a substantial difference in overall survival between the severe aGVHD cases and the other patients in the adult cohort (Fig. 1a). aGOAT encompassed a total of 194 dynamic variables for the adult cohort and 159 dynamic variables for the pediatric cohort (Supplementary Table 2). The dynamic variables were not measured uniformly across all the patients (Fig. 1b and Supplementary Table 3). Fifteen peri-transplantation variables were also included in aGOAT (Supplementary Table 4).
We have devised a dynamic probabilistic model—‘daGOAT’ (dynamic aGVHD Onset Anticipation Tianjin)—that integrates multidimensional time-series data to calculate the risk for severe aGVHD. Our model updates the risk score φi(t) for developing severe aGVHD between t + 1 and t + δ according to
where ρ(zi) and θk(xik(τ), τ) define the contribution of all peri-transplantation features zi and the contribution of the individual dynamic variable xik(τ), respectively, to the relative risk of the ith patient developing severe aGVHD between t + 1 and t + δ.
The motivation behind equation (1) was to first highlight short-stretch temporal patterns that were suggestive of prodromes for severe aGVHD, with δ being the length of the moving time window (we set δ = 14). The higher the φi(t), the likelier there was a prodrome between τ = t − δ + 1 and τ = t. Considering the evolving nature of immune reconstitution after HSCT, we assumed the development of severe aGVHD was not a stationary process and defined \({\theta _k}{\left( \cdot \right)}\) to be a function of time. In other words, the same value for the same clinical feature might have different implications at different times after transplantation. The final risk score on each day was highly dependent on all available clinical information and computed scores in previous days, so, in this sense, daGOAT is an adaptive algorithm.
We compared daGOAT to two landmark-specific plasma biomarker-based models (the two-biomarker MAGIC score8 and the three-biomarker Ann Arbor score (based on ST2, Reg3α and tumor necrosis factor receptor 1, TNFR1)9), two peri-transplantation features-based models (fitted using Naïve Bayes (‘PeriHSCT-NB’) or Random Forest (‘PeriHSCT-RF’)) and XGBoost (a gradient-boosting tree algorithm that permits data missing)13.
In the adult cohort, plasma biomarker data at days 6–8 were available for 67 patients. The MAGIC score achieved AUROCs of 0.86 and 0.49 for six-month NRM and severe aGVHD, respectively, and the Ann Arbor score achieved AUROCs of 0.59 and 0.50 for six-month NRM and severe aGVHD, respectively (Fig. 1c). The pediatric cohort’s biomarker data were too small in size to test the biomarker models.
To evaluate daGOAT, PeriHSCT-NB, PeriHSCT-RF and XGBoost, the patients who received HSCT before (excluding) 1 December 2020 were designated as training sets and the rest of the patients as test sets. Internal validation within the training sets (a series of temporal splitting within the training sets) and holdout validation on the test sets were then conducted.
In the adult cohort, the Q1 (25th percentile), Q2 (median) and Q3 (75th percentile) time points for disease onset were days 24, 29 and 39, respectively. In internal validation (n = 519), daGOAT’s area under the precision-recall curve (AUPRC) peaked at 0.42 (mean; s.d., 0.39 (range, 0.06–1.00)) on day 23, surpassing all the other models (Extended Data Fig. 1a,b). It is worthwhile noting, however, that the PeriHSCT models outcompeted daGOAT during the first two weeks after transplantation. AUROC values showed a trend similar to AUPRC values. daGOAT’s AUROC peaked at 0.78 (mean; s.d., 0.17 (range, 0.58–1.00)) on day 23 (Extended Data Fig. 1c and Supplementary Fig. 3). In holdout validation (n = 65), daGOAT’s AUPRC and AUROC peaked at 0.82 and 0.94, respectively, on day 23, outcompeting all the other models (Extended Data Fig. 2a–c and Supplementary Fig. 3). Again, the PeriHSCT models surpassed daGOAT during the first two weeks (Extended Data Fig. 2a and Supplementary Fig. 3). For the adult cohort, on day 23, hazard ratios (HRs) between high-risk (risk score top 1/6) and low-risk (risk score bottom 5/6) patients according to daGOAT were 2.07 (95% confidence interval (CI), 1.03–4.15) and 18.40 (95% CI, 3.95–85.60) in internal and holdout validations, respectively, in the ensuing two-week window (Extended Data Figs. 1d and 2d).
In the pediatric cohort, the Q1, Q2 and Q3 time points for disease onset were days 15, 17 and 21, respectively. In internal validation (n = 39), daGOAT’s AUPRC peaked at 0.75 (mean; s.d. 0.28 (range, 0.45–1.00)) on day 10, surpassing all the other models (Extended Data Fig. 3a,b). AUROC values showed a similar trend to AUPRCs. daGOAT’s AUROC peaked at 0.87 (mean; s.d., 0.13 (range, 0.75–1.00)) on day 10 (Extended Data Fig. 3c and Supplementary Fig. 3). In holdout validation (n = 6), daGOAT’s AUPRC and AUROC values were both 1.00 on day 10—on par with XGBoost and outcompeting the PeriHSCT models (Extended Data Fig. 4a–c and Supplementary Fig. 3). Note, however, that XGBoost’s AUPRC and AUROC values were more stable in time than daGOAT’s in holdout validation for the pediatric cases (Extended Data Fig. 4a and Supplementary Fig. 3). For the pediatric cohort, on day 10, HRs between patients classified as high-risk (risk score top 1/6) and low-risk (risk score bottom 5/6) identified by daGOAT were 4.47 (95% CI, 1.14–17.50) and incalculable (due to the small sample size) in internal and holdout validations, respectively, in the ensuing two-week window (Extended Data Figs. 3d and 4d).
We also investigated the differential contributions of individual dynamic features to aGVHD prediction. The importance score of a feature at a given time point was calculated as the decremental change of AUPRC (based on internal validation within the training set) at that time point if the feature was entirely ignored from day 1 through day 100. Interestingly, the importance scores of many features varied smoothly with time (that is, scores at neighboring time points were similar to one another; Fig. 2a,b). We conducted an ablation experiment with daGOAT by removing its smoothing component and then testing the truncated version of the model. As expected, daGOAT without smoothing performed worse in both the adult and pediatric cohorts (Fig. 2c).
We ranked all the dynamic features according to their maximum importance scores during days 8–30 (Fig. 2a,b and Supplementary Table 5). Spearman’s rank correlation coefficients between feature importance and data density were 0.51 and 0.72 in the adult and pediatric cohorts, respectively. Although data densities were clearly influential in the rankings of the dynamic features, some top-ranked features nonetheless had lower data densities (especially in the adult cohort).
The ability of daGOAT to predict severe aGVHD depended on leveraging the multidimensionality of data (Fig. 2d,e). In the training sets, performance metrics peaked when ~20% top-ranked variables were used in the model. In holdout validations, however, it took at least 80% and 50% of the dynamic features for model performance to approach saturation in the adult and pediatric cohorts, respectively.
To explore plausible explanations for daGOAT’s advantage over benchmarks in predicting severe aGVHD, we conducted simulation experiments in which three data characteristic parameters were manipulated systematically. The first was the complexity of the underlying process. A more complex process had a higher number of effector features that each had independent association with event onset. On the other hand, when the underlying process was more simple, most observed features were dummies and made no contribution to relative risk. The second parameter was the smoothness of the underlying process. When the underlying process was ‘smooth’, each feature’s contribution to relative risk could change over time, but there were no rapid up-and-down swings. The third parameter was the data missing rate. Data missing was expected to make model fitting more difficult.
Examining the simulation results, we found that daGOAT outperformed XGBoost when most of the observed variables were associated with event onsets, when the underlying event-generating process was smooth and when there was much data missing (Extended Data Fig. 5a).
Although this study focused on HSCT, we also tested whether our proposed approach could be generalized to a remotely unrelated scenario of dynamic event forecasting using multivariate time-series. More specifically, we asked whether waist-mounted smartphone inertial sensor data could be utilized by daGOAT to anticipate a person’s postural change, that is, to predict if a sitting person was about to stand up. A publicly available smartphone inertial sensor dataset14 was downloaded from the UCI Machine Learning Repository. The dataset contained time-series data (Δt = 1.28 s) of 30 human subjects that covered 561 features. Each discrete time point was associated with a label that indicated whether the person was sitting, standing up (transitioning from sitting to standing), standing or performing another activity at that moment. There was no missing value. Although we did not have direct insights into the underlying neurobehavioral process, we postulated that the relationship between prodromic subtle motions and human postural change was probably smooth.
Four models—daGOAT, Naïve Bayes, Random Forest and XGBoost—were tested on the smartphone inertial sensor dataset. For daGOAT, a ‘+’ data segment would be akin to a ‘severe aGVHD case’, and its associated 561-feature time-series would be its ‘presymptomatic clinical data’. daGOAT’s AUPRC and AUROC peaked at 0.29 and 0.73 (Extended Data Fig. 5b–e), respectively, ≥2.56 s before the observed postural transition, outperforming Naïve Bayes, Random Forest and XGBoost.
Although we caution that our simulation experiments and smartphone data analysis were far from encompassing all possible real-world scenarios, they nonetheless served as a conduit to understanding the possible mechanisms behind daGOAT’s comparative advantage in severe aGVHD prediction.
In contrast to modeling in HSCT, machine learning research on dynamic risk monitoring based on high-density multidimensional time-series data has been particularly active in intensive care in recent years. Instead of banking on a small set of biomarkers, researchers have taken a holistic approach that considers time-series of a high number of features to forecast shock15,16 and to make artificial intelligence-based recommendations for sepsis treatment17. Homogeneous ultrahigh data density in intensive care units is nevertheless an outlier situation. As of today, ‘spotty data’ (inadequate data densities) remain the norm in most real-world healthcare settings (outside of clinical trials), and this includes HSCT. Machine learning research in HSCT is furthermore hampered by smaller sample sizes.
When most of the observed features independently contribute to relative risk, there is little benefit for a model to distinguish between true effectors and dummies. Accordingly, daGOAT does not conduct any variable selection, whereas—despite the small sample size and the time-varying nature of feature contributions to relative risk—on each day XGBoost would have to pick a new set of key variables to grow trees. The comparatively good performance of our modeling approach suggests that it is feasible to predict severe aGVHD cost-effectively when taking a panoramic and dynamic view of a patient’s clinical profile. The average total daily cost (charged to the patient) for data collection from day 1 through day 30 post-transplant to support daGOAT was 261 renminbi per day per pediatric patient and 428 renminbi per day per adult patient at the IHCAMS. Despite the large number of dynamic variables included in our model, most of the data utilized in the daGOAT algorithm are collected in routine clinical care after transplantation and thus do not incur additional cost.
For deployment in clinical settings, the daGOAT model must be integrated into the hospital information system. On any given day we have approximately 100 patients who have recently undergone HSCT and are still hospitalized at the IHCAMS; these are the patients whose dynamic clinical data need to be updated daily. Our semi-automatic data process takes less than 30 min to extract the 100 patients’ newly collected data on the latest day from the electronic health records and subsequently append the new incoming data to the data accumulated in previous days. daGOAT is fast to compute. On average, computing φi(t) for 100 consecutive days for one patient takes ~0.5 s. Model fitting is also reasonably fast. Fitting daGOAT on our adult training set, for example, took less than 1 min using a typical desktop computer. In summary, daGOAT is easy to implement, provided that the hospital information system is sufficiently ‘modern’.
Regrettably, this study was limited to data from one hematological center in China, and additional validation at other hospitals will be needed. The ultimate litmus test of our model would be testing whether we can reduce early mortality after transplantation by applying the model prospectively to administer intensified prophylactic immunosuppression to a targeted subset of allo-HSCT patients who are predicted to have high risk for developing severe aGVHD.
Methods
The aGOAT dataset
We focused on modeling severe aGVHD in HLA-mismatched allo-HSCT, because HLA mismatch is the most important factor associated with aGVHD18. It should be noted, however, that 97% of the HLA-mismatched adult and 100% of the HLA-mismatched pediatric instances were haploidentical in the final dataset (Supplementary Table 1). In the following, we describe, in detail, the process of compiling the aGOAT dataset (Supplementary Fig. 1).
Post-transplant multidimensional time-series clinical data of 598 adult patients (age >16 years) who received HLA-mismatched allo-HSCT with stem cells sourced from peripheral blood, bone marrow or both between 1 April 2012 and 30 April 2021 and 54 pediatric patients (age ≤16 years) who received HLA-mismatched allo-HSCT with stem cells sourced from peripheral blood, bone marrow or both between 1 April 2018 and 31 March 2021 at the IHCAMS were able to be electronically retrieved and curated.
The medical records for each case were reviewed by two or three physicians to confirm the aGVHD diagnosis and grading (according to the MAGIC criteria19). To avoid ambiguity, onset of aGVHD was uniformly defined as the day of initiating aGVHD treatment. After the physicians’ review, 16 cases (ten adults and six children) were eliminated due to failure of neutrophil engraftment within 30 days of transplantation. An additional seven cases (four adults and three children) were eliminated because the recorded date of neutrophil engraftment (defined as the date of the first of three consecutive measurements spanning ≥3 days of achieving a sustained peripheral blood neutrophil count of >500 × 106 l−1) did not precede the recorded onset of aGVHD.
The final dataset contained 584 adult cases and 45 pediatric cases. The adult and pediatric cohorts had substantially different baseline distributions in age, primary diseases, stem cell sources, conditioning regimens and aGVHD prophylaxis regimens (Supplementary Table 1). Because the adult cohort and the pediatric cohort were treated at different divisions of the IHCAMS, our modeling efforts on the two cohorts were two independent validations of the daGOAT algorithm.
Sixteen percent of the adult cohort and 24% of the pediatric cohort suffered from severe aGVHD within 100 days. Moreover, the severe aGVHD instances in the pediatric cohort tended to experience onset much earlier than those in the adult cohort (Fig. 1b and Supplementary Table 1). There was a substantial difference in three-year all-cause mortality between the patients with severe aGVHD and the other patients in the adult cohort (HR 3.30 (95% CI, 2.28–4.79); P < 0.001, log-rank test) (Fig. 1a), and a similar trend also appeared to exist in the pediatric cohort, although it did not reach statistical significance (HR 4.40 (95% CI, 0.58–33.47); P = 0.120, log-rank test) (Supplementary Fig. 2).
aGOAT encompassed a total of 194 dynamic variables for the adult cohort and 159 dynamic variables for the pediatric cohort, collected during the first 100 days after transplantation (Supplementary Table 2), including vital signs, daily fluid loss (due to diarrhea, vomiting and so on), complete blood counts, blood chemistry and electrolytes, peripheral blood/bone marrow immune cell profiles (measured by flow cytometry), plasma inflammatory factor levels and so on. The dynamic variables were not measured uniformly across all patients. Some dynamic variables such as vital signs were available nearly daily, whereas others such as blood immune cell profiles and plasma inflammatory factor levels were measured less frequently and not in all patients. In addition, 15 peri-transplantation variables were also included in aGOAT (Supplementary Table 4), including information related to primary disease, blood type, stem cell source, HLA mismatch, conditioning regimen before transplantation, use of antithymocyte globulin in conditioning, aGVHD prophylaxis regimen, transplantation year and so on.
Outlier values in vital signs (for example, exorbitant values for body temperature) were made blank. Whenever a dynamic variable was measured more than once (distinct samples) on one particular day for one patient, the average measurement value of that day was used for that day for that patient. We applied the ‘time-limited sample-and-hold’ approach commonly used in intensive care unit data analysis17 to augment the aGOAT dataset (holding time set to three days after sampling), based on the hypothesis that most measurements were valid for three additional days. This augmented dataset was still very sparse in multiple categories of dynamic variables (Fig. 1b and Supplementary Table 3). No other missing-data imputation procedure was conducted to address the problem of nonuniform data measurement.
Validation of the MAGIC score and the Ann Arbor score
The MAGIC score was calculated as −11.263 + 1.844(log(ST2)) + 0.577(log(REG3α)). The Ann Arbor score was calculated as −9.169 + 0.598(log(TNFR1)) − 0.028(log(REG3α)) + 0.189(log(ST2)). All the used coefficient values were identical to those reported in the original reports8,9. The original reports did not specify the units for plasma biomarker measurements, and the two scores used different bases for the logarithm (MAGIC, 10; Ann Arbor, 2). None of these, however, affected the AUROC and AUPRC calculations.
daGOAT model
We designed the daGOAT algorithm with the motivation to leverage one presumed nature of post-HSCT time-series data: the underlying biological process for aGVHD onset is multidimensional and smooth with respect to time. By explicitly taking the temporal order of data into account, daGOAT borrows strengths from neighboring time points. Even if a feature is missing a value on one particular day, the model might still learn its contribution to relative risk on that day by interpolating between neighboring time points.
Our model integrates multidimensional time-series data to calculate risk for severe aGVHD onset between t + 1 and t + δ according to
where ρ(zi) and \({\theta _k}{\left( {x_{ik}{\left( \tau \right)},\,{\tau} } \right)}\) define the contribution of all peri-transplantation features zi and the contribution of the individual dynamic variable xik(τ), respectively, to the relative risk of patient i developing severe aGVHD between t + 1 and t + δ. Iikτ = 0 when xik(τ) lacks value for the ith patient, and Iikτ = 1 otherwise. Although equation (1) does not presume the size of the time step, updating φi(t) daily was deemed sufficient for severe aGVHD prediction.
We fitted daGOAT as follows. First, ρ(zi) was set to be the log-odds ratio computed by the standard Naïve Bayes algorithm. Second, for every k and t, we computed the cutoff value ckt that maximized Shannon’s mutual information between the kth dynamic variable at time t and severe aGVHD occurrence, then we set lk and uk to be the 25th and 75th percentile values among ck1, …, ckT, respectively. This step computed the optimal cutoff values {lk, uk} to discretize the kth dynamic variable. Third, for every k and t we computed
then we computed \({\hat \rho _{1k}^{\left( {\rm{L}} \right)}{\left( t \right)}}\), \({\hat \rho _{1k}^{\left( {\rm{H}} \right)}{\left( t \right)}}\), \({\hat \rho _{0k}^{\left( {\rm{L}} \right)}{\left( t \right)}}\) and \({\hat \rho _{0k}^{\left( {\rm{H}} \right)}{\left( t \right)}}\) as ‘smoothed’ versions of \({\rho _{1kt}^{\left( {\rm{L}} \right)}}\), \({\rho _{1kt}^{\left( {\rm{H}} \right)}}\), \({\rho _{0kt}^{\left( {\rm{L}} \right)}}\) and \({\rho _{0kt}^{\left( {\rm{H}} \right)}}\), respectively, by smoothing-spline fitting (smooth with respect to t). This step computed the discretized probability distribution of the kth dynamic variable that was smooth along the time axis. Note that we did not conduct interpolations on the raw data. True distributions of feature values in normal, prodromic and diseased states are unknown, and it is unclear how to best perform interpolations on the raw data for a wide range of variables expressed in various units (for example, should we interpolate values in the original scale or in the log scale?). Instead, our model was distribution-agnostic and performed interpolation on the estimated relative risk contribution terms, that is, in the probability space.
Finally, we defined
where \({{\gamma} \ge {0}}\) is a hyperparameter that we set to be 0.1. When the model was run in the ‘no smoothing’ mode, \({\hat \rho _{1k}^{\left( {\rm{L}} \right)}{\left( t \right)}}\), \({\hat \rho _{1k}^{\left( {\rm{H}} \right)}{\left( t \right)}}\), \({\hat \rho _{0k}^{\left( {\rm{L}} \right)}{\left( t \right)}}\) and \({\hat \rho _{0k}^{\left( {\rm{H}} \right)}{\left( t \right)}}\) were not calculated, and \({\rho _{1kt}^{\left( {\rm{L}} \right)}}\), \({\rho _{1kt}^{\left( {\rm{H}} \right)}}\), \({\rho _{0kt}^{\left( {\rm{L}} \right)}}\) and \({\rho _{0kt}^{\left( {\rm{H}} \right)}}\) were used instead for calculating θk(x, t).
Data generation for simulation experiments
We generated multidimensional time-series data using a simplified version of equation (1):
where φi is the relative risk of the event of interest happening to individual i, p = 50 and T = 14. To further simplify the simulations (without hurting generalizability), we assumed each xik(t) term had only two possible values: low and high.
First, for each individual i and feature k, a smooth time-series xik(t) was generated by a random walk that meandered between low and high (transition rates: ‘low→low’ or ‘high→high’, 0.7; ‘low→high’ or ‘high→low’, 0.3).
Second, we simulated values for θk(x, t), the underlying process for event generation. When the underlying process was designated to be more complex (that is, a high percentage of observed variables were indeed correlated with event onset), we set θk(x, t) = 0 for \({{k} \in \left[ {46,\,50} \right]}\). On the other hand, when the underlying process was designated to be simple (that is, most observed variables were dummies), we set θk(x, t) = 0 for \({{k} \in \left[ {6,\,50} \right]}\). For all the other k values, when the underlying process was designated to be smooth, we first generated a smooth time-series αk(t) through a random walk with multiplicative Gaussian steps (mean = 1; s.d. = 0.05) starting from αk(1) = 1. On the other hand, when the underlying process was designated to be not smooth, αk(t) at each t was independently drawn from a uniform distribution between 0 and 1. The values of αk(t) were then linearly rescaled so that \({\mathop {{\min }}\limits_{{t} \in \left[ {1,\,T} \right]} {\alpha _k}{\left( t \right)}} = {0.3}\) and \({\mathop {{\max }}\limits_{{t} \in \left[ {1,\,T} \right]} {\alpha _k}{\left( t \right)}} = {0.7}\); then we defined \({\theta _k}{\left( {{{{\mathrm{high}}}},\,t} \right)} = {\rm{log}}{\left( {\frac{{{\alpha _k}{\left( t \right)}}}{{{1} - {\alpha _k}{\left( t \right)}}}} \right)}\) and θk(low, t) = −θk(high, t).
Third, the generated values xik(t) and θk(x, t) were then plugged into equation (2) to calculate φi. For each i, draw a random number γ from a uniform distribution between 0 and 1. If γ < φi, the event of interest happened to individual i; otherwise, the event did not happen to individual i.
Finally, when the data missing rate was designated to be high, 60% of the xik(t) terms were randomly marked as missing; otherwise, there was no data missing.
For each combination of data characteristic parameters, the simulation was repeated 150 times. Each run was conducted with n = 1,000 (80:20 random split for holdout validation) and a constant 15% percentage of positive cases within the simulated sample (a larger cohort of individuals was first generated and then randomly downsampled to the desired size).
The smartphone-based recognition of human activities and postural transitions dataset
The waist-mounted smartphone inertial sensor dataset14 was downloaded from the UCI Machine Learning Repository (https://archive.ics.uci.edu/ml/datasets/Smartphone-Based+Recognition+of+Human+Activities+and+Postural+Transitions). Before download, the smartphone dataset had already been randomly partitioned into two subsets, with 70% of the subjects designated as the training set and the rest as the test set. From this dataset we extracted all 8.96-s time-series segments in which a person was continuously sitting and then classified each segment according to whether the person stood up within the next 5.12 s (‘+’) or not (‘−’). The training set contained 789 data segments (‘+’, n = 92; ‘−’, n = 697), and the test set contained 292 segments (‘+’, n = 40; ‘−’, n = 252).
Reporting Summary
Further information on research design is available in the Nature Research Reporting Summary linked to this Article.
Data availability
Although there was no genetic polymorphism, gene expression or protein sequence data involved in this study, sharing of substantial clinical data generated from China’s human genetic resources needs to abide by the Regulations of the People’s Republic of China (PRC) on the Administration of Human Genetic Resources. A ‘minimum dataset’ for severe aGVHD that would be necessary to verify the research in this Article would include all dynamic features (from day 1 through day 100 post-HSCT) listed in Supplementary Table 2, all peri-transplantation features listed in Supplementary Table 4, presence/absence of severe aGVHD within 100 days, and onset dates of severe aGVHD. On 21 February 2022, the PRC Human Genetic Resources Administration Office approved the compilation of a desensitized version of the ‘minimum aGOAT dataset’ for facilitating international research collaborations (Reference No. CJ0272 (2022)). At the time of the publication of the manuscript, the authors’ application to deposit this desensitized dataset at the PRC National Genomics Data Center (NGDC) database was still under review. A mock-up dataset that can be used for demo runs of the daGOAT algorithm is available in a public Zenodo repository (https://doi.org/10.5281/zenodo.6050675)20. After publication of this work, the Chinese government has approved the archiving of the aGOAT dataset at the PRC National Genomics Data Center (NGDC) in April, 2022 (ref. no. 2022BAT1224). Accordingly, the authors have made the dataset publicly accessible at the NGDC website (https://ngdc.cncb.ac.cn/omix/release/OMIX001095/). The waist-mounted smartphone inertial sensor dataset is available from the UCI Machine Learning Repository (https://archive.ics.uci.edu/ml/datasets/Smartphone-Based+Recognition+of+Human+Activities+and+Postural+Transitions). Source data for Figs. 1 and 2 and Extended Data Figs. 1–5 are provided with this paper.
Code availability
The R code used in this study is available in a public GitHub repository at https://github.com/chenjunren-ihcams/daGOAT (https://doi.org/10.5281/zenodo.6041841)21.
Change history
03 May 2022
A Correction to this paper has been published: https://doi.org/10.1038/s43588-022-00256-7
References
Khoury, H. J. et al. Improved survival after acute graft-versus-host disease diagnosis in the modern era. Haematologica 102, 958–966 (2017).
Arai, Y. et al. Using a machine learning algorithm to predict acute graft-versus-host disease following allogeneic transplantation. Blood Adv. 3, 3626–3634 (2019).
Lee, C. et al. Prediction of absolute risk of acute graft-versus-host disease following hematopoietic cell transplantation. PLoS ONE 13, e0190610 (2018).
Vander Lugt, M. T. et al. ST2 as a marker for risk of therapy-resistant graft-versus-host disease and death. N. Engl. J. Med. 369, 529–539 (2013).
McDonald, G. B. et al. Plasma biomarkers of acute GVHD and nonrelapse mortality: predictive value of measurements before GVHD onset and treatment. Blood 126, 113–120 (2015).
Solan, L. et al. ST2 and REG3α as predictive biomarkers after haploidentical stem cell transplantation using post-transplantation high-dose cyclophosphamide. Front. Immunol. 10, 2338 (2019).
Matsumura, A. et al. Predictive values of early suppression of tumorigenicity 2 for acute GVHD and transplant-related complications after allogeneic stem cell transplantation: prospective observational study. Turk. J. Haematol. 37, 20–29 (2020).
Hartwell, M. J. et al. An early-biomarker algorithm predicts lethal graft-versus-host disease and survival. JCI Insight 2, e89798 (2017).
Levine, J. E. et al. A prognostic score for acute graft-versus-host disease based on biomarkers: a multicentre study. Lancet Haematol. 2, e21–e29 (2015).
Major-Monfried, H. et al. MAGIC biomarkers predict long-term outcomes for steroid-resistant acute GVHD. Blood 131, 2846–2855 (2018).
Tang, S. et al. Predicting acute graft-versus-host disease using machine learning and longitudinal vital sign data from electronic health records. JCO Clin. Cancer Inform. 4, 128–135 (2020).
Gupta, V., Braun, T. M., Chowdhury, M., Tewari, M. & Choi, S. W. A systematic review of machine learning techniques in hematopoietic stem cell transplantation (HSCT). Sensors (Basel) 20, 6100 (2020).
Chen, T. & Guestrin, C. XGBoost: a scalable tree boosting system. In Proc. 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining 785–794 (ACM, 2016).
Reyes-Ortiz, J.-L., Oneto, L., Samà, A., Parra, X. & Anguita, D. Transition-aware human activity recognition using smartphones. Neurocomputing 171, 754–767 (2014).
Henry, K. E., Hager, D. N., Pronovost, P. J. & Saria, S. A targeted real-time early warning score (TREWScore) for septic shock. Sci. Transl. Med. 7, 299ra122 (2015).
Hyland, S. L. et al. Early prediction of circulatory failure in the intensive care unit using machine learning. Nat. Med. 26, 364–373 (2020).
Komorowski, M., Celi, L. A., Badawi, O., Gordon, A. C. & Faisal, A. A. The artificial intelligence clinician learns optimal treatment strategies for sepsis in intensive care. Nat. Med. 24, 1716–1720 (2018).
Kanda, J. Effect of HLA mismatch on acute graft-versus-host disease. Int. J. Hematol. 98, 300–308 (2013).
Schoemans, H. M. et al. EBMT-NIH-CIBMTR Task Force position statement on standardized terminology and guidance for graft-versus-host disease assessment. Bone Marrow Transplant 53, 1401–1415 (2018).
Chen, J. R. Mock up dataset for daGOAT project [Data set]. Zenodo, https://doi.org/10.5281/zenodo.6050675 (2022).
Chen, J. R., Wang, Y., Cui, M. X. & Li, L. F. daGOAT (v1.1). Zenodo, https://doi.org/10.5281/zenodo.6041841 (2022).
Acknowledgements
This work was supported in part by the State Key Laboratory of Experimental Hematology research grant no. Z20-01 (to J.R.C.) and the CAMS Innovation Fund for Medical Sciences grant no. 2020-I2M-C&T-B-089 (to Y.G.). We thank H. Zhang for assistance in determining neutrophil engraftment dates. The Article processing charge of this manuscript is paid by the Tianjin Science and Technology Plan grant no. 20ZYJDSY00010.
Author information
Authors and Affiliations
Contributions
J.C., E.J. and X. Zhu supervised the study. J.C., E.J. and X. Zhu designed the study, with contributions from Y.G. and Y.C. X.L. coordinated the study, with contributions from S.Z. and Z.S. Y.C., Y.G., M.W., L.Z., X.C., S.F., M.H., E.J. and X. Zhu contributed to data collection. X.L., Y.C., Y.G., X.G., Y.F., M.W. and W.G. compiled, reviewed and curated the dataset, with contributions from N.Z., X.S., X. Zheng and X.C. J.C. designed the algorithm, with contributions from Y.W., M.C. and L.L. J.C., Y.W. and M.C. performed the computation, with contributions from Y.F., X.G., Q.S. and X.L. J.C. wrote the first draft of the manuscript, with contributions from Y.W., X.L., M.C., X.G., Y.C., E.J. and X. Zhu. J.C., Y.W., M.C., X.G., Y.F. and X.L. revised the manuscript, with contributions from M.W., Y.C., Y.G., E.J. and X. Zhu.
Corresponding authors
Ethics declarations
Competing interests
The authors declare no competing interests.
Ethics declarations
This retrospective study was initiated in October 2020 and became part of a larger-scope research program (NICHE-GOAT), which was approved by the IHCAMS Clinical Research Academic Committee on 11 January 2021 (IIT2021006) and by the IHCAMS Ethics Committee on 7 February 2021 (IIT2021006-EC-1). To avoid biased healthcare or research decisions, patients who received HSCT later than 1 December 2020 were not included in this study until after 7 February 2021. All the patients included in this study signed an informed consent form that permitted their biological samples or data to be utilized for research.
Peer review
Peer review information
Nature Computational Science thanks Yasuyuki Arai, Vibhuti Gupta and Amin Turki for their contribution to the peer review of this work. Primary Handling Editor: Kaitlin McCardle, in collaboration with the Nature Computational Science team. Peer reviewer reports are available.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Extended data
Extended Data Fig. 1 Internal validation of daGOAT, PeriHSCT-NB, PeriHSCT-RF, and XGBoost in the adult cohort.
a, Temporal profiles of AUPRC from day 1 through day 100. (Blue, daGOAT; orange, PeriHSCT-NB; red, PeriHSCT-RF; green, XGBoost.) The bottom panel shows how the number of remaining positive cases (severe aGVHD cases that had not had onset) decreased over time in the training set. b, Precision-recall curves for the four models on day 23 (daGOAT’s peak performance day) in the training set. c, Receiver operating characteristic curves for the four models on day 23 in the training set. (The dotted line represents the null model.) d, Cumulative incidence curves for severe aGVHD within the ensuing two-week window (days 24–37) after the training-set patients were stratified on day 23. All-cause death was treated as a competing event that precluded severe aGVHD when calculating the cumulative incidence curves for severe aGVHD. ‘High-risk’ (purple), daGOAT score among top 1/6; ‘low-risk’ (cyan), daGOAT score among bottom 5/6.
Extended Data Fig. 2 Holdout validation of daGOAT, PeriHSCT-NB, PeriHSCT-RF, and XGBoost in the adult cohort.
a, Temporal profiles of AUPRC from day 1 through day 100. The bottom panel shows how the number of remaining positive cases (severe aGVHD cases that had not had onset) decreased over time in the test set. (Blue, daGOAT; orange, PeriHSCT-NB; red, PeriHSCT-RF; green, XGBoost.) b, Precision-recall curves for the four models on day 23 (daGOAT’s peak performance day for the adults according to internal validation) in the test set. c, Receiver operating characteristic curves for the four models on day 23 in the test set. (The dotted line represents the null model.) d, Cumulative incidence curves for severe aGVHD within the ensuing two-week window (days 24–37) after the test-set patients were stratified on day 23. All-cause death was treated as a competing event that precluded severe aGVHD when calculating the cumulative incidence curves for severe aGVHD. ‘High-risk’ (purple), daGOAT score among top 1/6; ‘low-risk’ (cyan), daGOAT score among bottom 5/6.
Extended Data Fig. 3 Internal validation of daGOAT, PeriHSCT-NB, PeriHSCT-RF, and XGBoost in the pediatric cohort.
a, Temporal profiles of AUPRC from day 1 through day 100. The bottom panel shows how the number of remaining positive cases (severe aGVHD cases that had not had onset) decreased over time in the training set. (Blue, daGOAT; orange, PeriHSCT-NB; red, PeriHSCT-RF; green, XGBoost.) b, Precision-recall curves for the four models on day 10 (daGOAT’s peak performance day) in the training set. c, Receiver operating characteristic curves for the four models on day 10 in the training set. (The dotted line represents the null model.) d, Cumulative incidence curves for severe aGVHD within the ensuing two-week window (days 11–24) after the training-set patients were stratified on day 10. All-cause death was treated as a competing event that precluded severe aGVHD when calculating the cumulative incidence curves for severe aGVHD. ‘High-risk’ (purple), daGOAT score among top 1/6; ‘low-risk’ (cyan), daGOAT score among bottom 5/6.
Extended Data Fig. 4 Holdout validation of daGOAT, PeriHSCT-NB, PeriHSCT-RF, and XGBoost in the pediatric cohort.
a, Temporal profiles of AUPRC from day 1 through day 100. The bottom panel shows how the number of remaining positive cases (severe aGVHD cases that had not had onset) decreased over time in the test set. (Blue, daGOAT; orange, PeriHSCT-NB; red, PeriHSCT-RF; green, XGBoost.) b, Precision-recall curves for the four models on day 10 (daGOAT’s peak performance day for the pediatric cases according to internal validation) in the test set. c, Receiver operating characteristic curves for the four models on day 10 in the test set. (The dotted line represents the null model.) d, Cumulative incidence curves for severe aGVHD within the ensuing two-week window (days 11–24) after the test-set patients were stratified on day 10. All-cause death was treated as a competing event that precluded severe aGVHD when calculating the cumulative incidence curves for severe aGVHD. ‘High-risk’ (purple), daGOAT score among top 1/6; ‘low-risk’ (cyan), daGOAT score among bottom 5/6.
Extended Data Fig. 5 Extension experiments on the daGOAT algorithm.
a, Performance of daGOAT and XGBoost in short time-series data simulation experiments under various scenarios of data characteristics. The length of the simulated time-series was uniformly T = 14. Mean values and standard errors of ΔAUROCs and ΔAUPRCs across the 14 time points are shown here, along with the raw values at the 14 time points overlaid as dots. (Purple, ΔAUPRC; cyan, ΔAUROC.) b − e, Performance of daGOAT, XGBoost, Naïve Bayes, and Random Forest on the UCI Machine Learning Repository smartphone inertial sensor data. Temporal profiles for AUROC (b) and AUPRC (d) show that daGOAT outperformed the other models from −8.0 to −2.5 s before postural transition. Receiver operating characteristic curves and precision-recall curves at −2.56 s (c,e) are also shown here to compare the models in better detail; dotted lines represent null models. (Blue, daGOAT; green, XGBoost; orange, Naïve Bayes; red, Random Forest.)
Supplementary information
Source data
Source Data Fig. 1
Statistical source data.
Source Data Fig. 2
Statistical source data.
Source Data Extended Data Fig. 1
Statistical source data.
Source Data Extended Data Fig. 2
Statistical source data.
Source Data Extended Data Fig. 3
Statistical source data.
Source Data Extended Data Fig. 4
Statistical source data.
Source Data Extended Data Fig. 5
Statistical source data.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Liu, X., Cao, Y., Guo, Y. et al. Dynamic forecasting of severe acute graft-versus-host disease after transplantation. Nat Comput Sci 2, 153–159 (2022). https://doi.org/10.1038/s43588-022-00213-4
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1038/s43588-022-00213-4
This article is cited by
-
Data-driven grading of acute graft-versus-host disease
Nature Communications (2023)
-
Post-transplant dynamic risk prediction
Nature Computational Science (2022)