Main

To this day, severe acute graft-versus-host disease (grade III–IV aGVHD) remains a leading cause of death after allogeneic hematopoietic stem cell transplantation (allo-HSCT)—a last-resort treatment for many blood diseases—with a transplant-related mortality rate as high as ~30% within 100 days1.

Previous algorithms for forecasting severe aGVHD were usually based on peri-transplantation features (including recipient, donor and transplantation procedural parameters) or ‘landmark’ biomarker analysis (designating a specific time point, post-transplant, for plasma biomarker analysis). Area under the receiver operator characteristic curve (AUROC) scores of models using only peri-transplantation features were reported to be ~0.62, even when data from more than 20,000 patients were available2,3. For landmark analysis, it has been reported in at least some study cohorts that suppression of tumorigenicity 2 (ST2)—by far the most promising biomarker for predicting treatment response ‘after’ aGVHD onset4,5—had either no substantial association with aGVHD6 or a low AUROC (0.56) for forecasting severe aGVHD5 if measured at days 11 to 17 post-transplant (that is, ‘before’ aGVHD onset). One study of a Japanese cohort, however, did show association (AUROC 0.66) between grade II–IV aGVHD and ST2 measured on day 147. A two-biomarker model using ST2 and regenerating islet-derived 3-alpha (Reg3α) measured on day 7 post-transplant—the MAGIC score—was shown to predict six-month non-relapse mortality (NRM; AUROC 0.68)8. Forecasting NRM, however, was not equivalent to forecasting severe aGVHD. Using data reported by Hartwell et al.8 (their table S6), one could calculate that the proportion of NRM cases attributed to aGVHD was statistically indistinguishable between MAGIC score-stratified ‘high-risk’ and ‘low-risk’ groups (64% (41/64) versus 57% (47/83); P = 0.458, χ2 test). In contrast, if computed ‘at’ or ‘after’ aGVHD onset, multi-biomarker scores were shown to be efficacious in predicting treatment response and long-term outcome8,9,10.

Few studies have used time-series of all available patient information for forecasting severe aGVHD. One recent study applied penalized logistic regression to vital signs (body temperature, heart rate and so on) that were consistently recorded within the first 10 days after HSCT and achieved an AUROC value of 0.66 for forecasting grade II–IV aGVHD11. Modeling in HSCT—unlike in more common medical situations—is challenged by the small sample size (n), high feature number (p) and nonuniform data sampling12. This study thus aims to utilize evidence from all available dynamic variables to forecast severe aGVHD better.

We compiled and curated the post-transplant multidimensional time-series data of patients treated with human leukocyte antigen (HLA)-mismatched allo-HSCT using stem cells derived from peripheral blood, bone marrow or both at the Institute of Hematology, Chinese Academy of Medical Sciences (IHCAMS) between 2012 and 2021 (hereafter referred to as the ‘aGOAT’ (aGVHD Onset Anticipation Tianjin) dataset).

aGOAT contained data from 584 adult and 45 pediatric cases, and 16% of the adult cohort and 24% of the pediatric cohort suffered from severe aGVHD (Supplementary Table 1). There was a substantial difference in overall survival between the severe aGVHD cases and the other patients in the adult cohort (Fig. 1a). aGOAT encompassed a total of 194 dynamic variables for the adult cohort and 159 dynamic variables for the pediatric cohort (Supplementary Table 2). The dynamic variables were not measured uniformly across all the patients (Fig. 1b and Supplementary Table 3). Fifteen peri-transplantation variables were also included in aGOAT (Supplementary Table 4).

Fig. 1: Characteristics of the aGOAT dataset.
figure 1

a, Overall survival of the adult cohort in the aGOAT dataset. The shown curves are Kaplan–Meier estimators. Red, severe aGVHD cases; black, the other cases. The survival curves for the pediatric cohort are provided in Supplementary Fig. 2. b, Data densities and severe aGVHD onset distributions in the aGOAT dataset. Top: data densities of dynamic variables after transplantation in the adult and pediatric cohorts in the aGOAT dataset (after ‘time-limited sample-and-hold’ data imputations). CBC, complete blood count. Bottom: temporal distributions of severe aGVHD onset. Brighter colors in the heat maps indicate higher densities. The numerical values of missing-data rates for individual dynamic variables are provided in Supplementary Table 3. c, Validation of the two-biomarker MAGIC score and three-biomarker Ann Arbor score. Shown here are AUROCs and AUPRCs that were calculated based on the 67 adult patients in the aGOAT dataset who had plasma biomarker values collected at days 6–8. Error bars are 95% CIs based on bootstrapping (1,000 bootstrap samples of the original 67 patients). Solid bars, AUROCs; striped bars, AUPRCs; gray, predicting six-month NRM; pink, predicting severe aGVHD within 100 days; light blue, predicting grade II–IV aGVHD within 100 days.

Source data

We have devised a dynamic probabilistic model—‘daGOAT’ (dynamic aGVHD Onset Anticipation Tianjin)—that integrates multidimensional time-series data to calculate the risk for severe aGVHD. Our model updates the risk score φi(t) for developing severe aGVHD between t + 1 and t + δ according to

$${{\varphi _i}{\left( t \right)}} = {\rho {\left( \bf{{z}_{i}} \right)}} + {\mathop {\sum }\limits_{\tau \in \left[ {{t} - {\delta} + {1},\,{t}} \right]} {\mathop {\sum }\limits_k} {\left( {{I_{ik\tau }} {\theta _k}\left( {{x_{ik}}{\left( \tau \right)},\,{\tau }} \right)} \right)}},$$
(1)

where ρ(zi) and θk(xik(τ), τ) define the contribution of all peri-transplantation features zi and the contribution of the individual dynamic variable xik(τ), respectively, to the relative risk of the ith patient developing severe aGVHD between t + 1 and t + δ.

The motivation behind equation (1) was to first highlight short-stretch temporal patterns that were suggestive of prodromes for severe aGVHD, with δ being the length of the moving time window (we set δ = 14). The higher the φi(t), the likelier there was a prodrome between τ=tδ + 1 and τ=t. Considering the evolving nature of immune reconstitution after HSCT, we assumed the development of severe aGVHD was not a stationary process and defined \({\theta _k}{\left( \cdot \right)}\) to be a function of time. In other words, the same value for the same clinical feature might have different implications at different times after transplantation. The final risk score on each day was highly dependent on all available clinical information and computed scores in previous days, so, in this sense, daGOAT is an adaptive algorithm.

We compared daGOAT to two landmark-specific plasma biomarker-based models (the two-biomarker MAGIC score8 and the three-biomarker Ann Arbor score (based on ST2, Reg3α and tumor necrosis factor receptor 1, TNFR1)9), two peri-transplantation features-based models (fitted using Naïve Bayes (‘PeriHSCT-NB’) or Random Forest (‘PeriHSCT-RF’)) and XGBoost (a gradient-boosting tree algorithm that permits data missing)13.

In the adult cohort, plasma biomarker data at days 6–8 were available for 67 patients. The MAGIC score achieved AUROCs of 0.86 and 0.49 for six-month NRM and severe aGVHD, respectively, and the Ann Arbor score achieved AUROCs of 0.59 and 0.50 for six-month NRM and severe aGVHD, respectively (Fig. 1c). The pediatric cohort’s biomarker data were too small in size to test the biomarker models.

To evaluate daGOAT, PeriHSCT-NB, PeriHSCT-RF and XGBoost, the patients who received HSCT before (excluding) 1 December 2020 were designated as training sets and the rest of the patients as test sets. Internal validation within the training sets (a series of temporal splitting within the training sets) and holdout validation on the test sets were then conducted.

In the adult cohort, the Q1 (25th percentile), Q2 (median) and Q3 (75th percentile) time points for disease onset were days 24, 29 and 39, respectively. In internal validation (n = 519), daGOAT’s area under the precision-recall curve (AUPRC) peaked at 0.42 (mean; s.d., 0.39 (range, 0.06–1.00)) on day 23, surpassing all the other models (Extended Data Fig. 1a,b). It is worthwhile noting, however, that the PeriHSCT models outcompeted daGOAT during the first two weeks after transplantation. AUROC values showed a trend similar to AUPRC values. daGOAT’s AUROC peaked at 0.78 (mean; s.d., 0.17 (range, 0.58–1.00)) on day 23 (Extended Data Fig. 1c and Supplementary Fig. 3). In holdout validation (n = 65), daGOAT’s AUPRC and AUROC peaked at 0.82 and 0.94, respectively, on day 23, outcompeting all the other models (Extended Data Fig. 2a–c and Supplementary Fig. 3). Again, the PeriHSCT models surpassed daGOAT during the first two weeks (Extended Data Fig. 2a and Supplementary Fig. 3). For the adult cohort, on day 23, hazard ratios (HRs) between high-risk (risk score top 1/6) and low-risk (risk score bottom 5/6) patients according to daGOAT were 2.07 (95% confidence interval (CI), 1.03–4.15) and 18.40 (95% CI, 3.95–85.60) in internal and holdout validations, respectively, in the ensuing two-week window (Extended Data Figs. 1d and 2d).

In the pediatric cohort, the Q1, Q2 and Q3 time points for disease onset were days 15, 17 and 21, respectively. In internal validation (n = 39), daGOAT’s AUPRC peaked at 0.75 (mean; s.d. 0.28 (range, 0.45–1.00)) on day 10, surpassing all the other models (Extended Data Fig. 3a,b). AUROC values showed a similar trend to AUPRCs. daGOAT’s AUROC peaked at 0.87 (mean; s.d., 0.13 (range, 0.75–1.00)) on day 10 (Extended Data Fig. 3c and Supplementary Fig. 3). In holdout validation (n = 6), daGOAT’s AUPRC and AUROC values were both 1.00 on day 10—on par with XGBoost and outcompeting the PeriHSCT models (Extended Data Fig. 4a–c and Supplementary Fig. 3). Note, however, that XGBoost’s AUPRC and AUROC values were more stable in time than daGOAT’s in holdout validation for the pediatric cases (Extended Data Fig. 4a and Supplementary Fig. 3). For the pediatric cohort, on day 10, HRs between patients classified as high-risk (risk score top 1/6) and low-risk (risk score bottom 5/6) identified by daGOAT were 4.47 (95% CI, 1.14–17.50) and incalculable (due to the small sample size) in internal and holdout validations, respectively, in the ensuing two-week window (Extended Data Figs. 3d and 4d).

We also investigated the differential contributions of individual dynamic features to aGVHD prediction. The importance score of a feature at a given time point was calculated as the decremental change of AUPRC (based on internal validation within the training set) at that time point if the feature was entirely ignored from day 1 through day 100. Interestingly, the importance scores of many features varied smoothly with time (that is, scores at neighboring time points were similar to one another; Fig. 2a,b). We conducted an ablation experiment with daGOAT by removing its smoothing component and then testing the truncated version of the model. As expected, daGOAT without smoothing performed worse in both the adult and pediatric cohorts (Fig. 2c).

Fig. 2: Characteristics of the daGOAT model.
figure 2

a,b, Temporal profiles of the importance scores for all dynamic features in the adult (a) and pediatric (b) cohorts. CV, coefficient of variation; DC, dendritic cell; RBC, red blood cell. The importance scores (theoretical range: −1 to 1) were calculated based on internal validation within the training sets. Red, positive importance score; blue, negative importance score. The features were ranked according to their maximum importance scores during days 8–30. The full rankings of all the features are available in Supplementary Table 5. The average data density of each dynamic feature was calculated by dividing its total data volume (that is, the total number of available values after ‘time-limited sample-and-hold’ data imputations) by the total number of patients and by the total number of days (30 days). Right panels: orange bars, maximum importance score during days 8–30; black lines, average data density during days 1–30. c, Ablation experiment with daGOAT. Removing the smoothing component of the algorithm hurt daGOAT’s performance at the peak performance days (adults, day 23; children, day 10). There was only one run (comparing ‘without smoothing’ and ‘with smoothing’) in each scenario, and the standard error was incalculable. Striped bars, AUPRCs; solid bars, AUROCs; brown, internal validation; green, holdout validation. d,e, Relationships between the number of top-ranked features included in the model and the model’s performance metrics (AUPRC and AUROC) on the peak performance days (adults, day 23; children, day 10) in internal and holdout validations for the adult (d) and pediatric (e) cohorts. Crosses, AUPRCs; circles, AUROCs; brown, internal validation; green, holdout validation.

Source data

We ranked all the dynamic features according to their maximum importance scores during days 8–30 (Fig. 2a,b and Supplementary Table 5). Spearman’s rank correlation coefficients between feature importance and data density were 0.51 and 0.72 in the adult and pediatric cohorts, respectively. Although data densities were clearly influential in the rankings of the dynamic features, some top-ranked features nonetheless had lower data densities (especially in the adult cohort).

The ability of daGOAT to predict severe aGVHD depended on leveraging the multidimensionality of data (Fig. 2d,e). In the training sets, performance metrics peaked when ~20% top-ranked variables were used in the model. In holdout validations, however, it took at least 80% and 50% of the dynamic features for model performance to approach saturation in the adult and pediatric cohorts, respectively.

To explore plausible explanations for daGOAT’s advantage over benchmarks in predicting severe aGVHD, we conducted simulation experiments in which three data characteristic parameters were manipulated systematically. The first was the complexity of the underlying process. A more complex process had a higher number of effector features that each had independent association with event onset. On the other hand, when the underlying process was more simple, most observed features were dummies and made no contribution to relative risk. The second parameter was the smoothness of the underlying process. When the underlying process was ‘smooth’, each feature’s contribution to relative risk could change over time, but there were no rapid up-and-down swings. The third parameter was the data missing rate. Data missing was expected to make model fitting more difficult.

Examining the simulation results, we found that daGOAT outperformed XGBoost when most of the observed variables were associated with event onsets, when the underlying event-generating process was smooth and when there was much data missing (Extended Data Fig. 5a).

Although this study focused on HSCT, we also tested whether our proposed approach could be generalized to a remotely unrelated scenario of dynamic event forecasting using multivariate time-series. More specifically, we asked whether waist-mounted smartphone inertial sensor data could be utilized by daGOAT to anticipate a person’s postural change, that is, to predict if a sitting person was about to stand up. A publicly available smartphone inertial sensor dataset14 was downloaded from the UCI Machine Learning Repository. The dataset contained time-series data (Δt = 1.28 s) of 30 human subjects that covered 561 features. Each discrete time point was associated with a label that indicated whether the person was sitting, standing up (transitioning from sitting to standing), standing or performing another activity at that moment. There was no missing value. Although we did not have direct insights into the underlying neurobehavioral process, we postulated that the relationship between prodromic subtle motions and human postural change was probably smooth.

Four models—daGOAT, Naïve Bayes, Random Forest and XGBoost—were tested on the smartphone inertial sensor dataset. For daGOAT, a ‘+’ data segment would be akin to a ‘severe aGVHD case’, and its associated 561-feature time-series would be its ‘presymptomatic clinical data’. daGOAT’s AUPRC and AUROC peaked at 0.29 and 0.73 (Extended Data Fig. 5b–e), respectively, ≥2.56 s before the observed postural transition, outperforming Naïve Bayes, Random Forest and XGBoost.

Although we caution that our simulation experiments and smartphone data analysis were far from encompassing all possible real-world scenarios, they nonetheless served as a conduit to understanding the possible mechanisms behind daGOAT’s comparative advantage in severe aGVHD prediction.

In contrast to modeling in HSCT, machine learning research on dynamic risk monitoring based on high-density multidimensional time-series data has been particularly active in intensive care in recent years. Instead of banking on a small set of biomarkers, researchers have taken a holistic approach that considers time-series of a high number of features to forecast shock15,16 and to make artificial intelligence-based recommendations for sepsis treatment17. Homogeneous ultrahigh data density in intensive care units is nevertheless an outlier situation. As of today, ‘spotty data’ (inadequate data densities) remain the norm in most real-world healthcare settings (outside of clinical trials), and this includes HSCT. Machine learning research in HSCT is furthermore hampered by smaller sample sizes.

When most of the observed features independently contribute to relative risk, there is little benefit for a model to distinguish between true effectors and dummies. Accordingly, daGOAT does not conduct any variable selection, whereas—despite the small sample size and the time-varying nature of feature contributions to relative risk—on each day XGBoost would have to pick a new set of key variables to grow trees. The comparatively good performance of our modeling approach suggests that it is feasible to predict severe aGVHD cost-effectively when taking a panoramic and dynamic view of a patient’s clinical profile. The average total daily cost (charged to the patient) for data collection from day 1 through day 30 post-transplant to support daGOAT was 261 renminbi per day per pediatric patient and 428 renminbi per day per adult patient at the IHCAMS. Despite the large number of dynamic variables included in our model, most of the data utilized in the daGOAT algorithm are collected in routine clinical care after transplantation and thus do not incur additional cost.

For deployment in clinical settings, the daGOAT model must be integrated into the hospital information system. On any given day we have approximately 100 patients who have recently undergone HSCT and are still hospitalized at the IHCAMS; these are the patients whose dynamic clinical data need to be updated daily. Our semi-automatic data process takes less than 30 min to extract the 100 patients’ newly collected data on the latest day from the electronic health records and subsequently append the new incoming data to the data accumulated in previous days. daGOAT is fast to compute. On average, computing φi(t) for 100 consecutive days for one patient takes ~0.5 s. Model fitting is also reasonably fast. Fitting daGOAT on our adult training set, for example, took less than 1 min using a typical desktop computer. In summary, daGOAT is easy to implement, provided that the hospital information system is sufficiently ‘modern’.

Regrettably, this study was limited to data from one hematological center in China, and additional validation at other hospitals will be needed. The ultimate litmus test of our model would be testing whether we can reduce early mortality after transplantation by applying the model prospectively to administer intensified prophylactic immunosuppression to a targeted subset of allo-HSCT patients who are predicted to have high risk for developing severe aGVHD.

Methods

The aGOAT dataset

We focused on modeling severe aGVHD in HLA-mismatched allo-HSCT, because HLA mismatch is the most important factor associated with aGVHD18. It should be noted, however, that 97% of the HLA-mismatched adult and 100% of the HLA-mismatched pediatric instances were haploidentical in the final dataset (Supplementary Table 1). In the following, we describe, in detail, the process of compiling the aGOAT dataset (Supplementary Fig. 1).

Post-transplant multidimensional time-series clinical data of 598 adult patients (age >16 years) who received HLA-mismatched allo-HSCT with stem cells sourced from peripheral blood, bone marrow or both between 1 April 2012 and 30 April 2021 and 54 pediatric patients (age ≤16 years) who received HLA-mismatched allo-HSCT with stem cells sourced from peripheral blood, bone marrow or both between 1 April 2018 and 31 March 2021 at the IHCAMS were able to be electronically retrieved and curated.

The medical records for each case were reviewed by two or three physicians to confirm the aGVHD diagnosis and grading (according to the MAGIC criteria19). To avoid ambiguity, onset of aGVHD was uniformly defined as the day of initiating aGVHD treatment. After the physicians’ review, 16 cases (ten adults and six children) were eliminated due to failure of neutrophil engraftment within 30 days of transplantation. An additional seven cases (four adults and three children) were eliminated because the recorded date of neutrophil engraftment (defined as the date of the first of three consecutive measurements spanning ≥3 days of achieving a sustained peripheral blood neutrophil count of >500 × 106 l−1) did not precede the recorded onset of aGVHD.

The final dataset contained 584 adult cases and 45 pediatric cases. The adult and pediatric cohorts had substantially different baseline distributions in age, primary diseases, stem cell sources, conditioning regimens and aGVHD prophylaxis regimens (Supplementary Table 1). Because the adult cohort and the pediatric cohort were treated at different divisions of the IHCAMS, our modeling efforts on the two cohorts were two independent validations of the daGOAT algorithm.

Sixteen percent of the adult cohort and 24% of the pediatric cohort suffered from severe aGVHD within 100 days. Moreover, the severe aGVHD instances in the pediatric cohort tended to experience onset much earlier than those in the adult cohort (Fig. 1b and Supplementary Table 1). There was a substantial difference in three-year all-cause mortality between the patients with severe aGVHD and the other patients in the adult cohort (HR 3.30 (95% CI, 2.28–4.79); P < 0.001, log-rank test) (Fig. 1a), and a similar trend also appeared to exist in the pediatric cohort, although it did not reach statistical significance (HR 4.40 (95% CI, 0.58–33.47); P = 0.120, log-rank test) (Supplementary Fig. 2).

aGOAT encompassed a total of 194 dynamic variables for the adult cohort and 159 dynamic variables for the pediatric cohort, collected during the first 100 days after transplantation (Supplementary Table 2), including vital signs, daily fluid loss (due to diarrhea, vomiting and so on), complete blood counts, blood chemistry and electrolytes, peripheral blood/bone marrow immune cell profiles (measured by flow cytometry), plasma inflammatory factor levels and so on. The dynamic variables were not measured uniformly across all patients. Some dynamic variables such as vital signs were available nearly daily, whereas others such as blood immune cell profiles and plasma inflammatory factor levels were measured less frequently and not in all patients. In addition, 15 peri-transplantation variables were also included in aGOAT (Supplementary Table 4), including information related to primary disease, blood type, stem cell source, HLA mismatch, conditioning regimen before transplantation, use of antithymocyte globulin in conditioning, aGVHD prophylaxis regimen, transplantation year and so on.

Outlier values in vital signs (for example, exorbitant values for body temperature) were made blank. Whenever a dynamic variable was measured more than once (distinct samples) on one particular day for one patient, the average measurement value of that day was used for that day for that patient. We applied the ‘time-limited sample-and-hold’ approach commonly used in intensive care unit data analysis17 to augment the aGOAT dataset (holding time set to three days after sampling), based on the hypothesis that most measurements were valid for three additional days. This augmented dataset was still very sparse in multiple categories of dynamic variables (Fig. 1b and Supplementary Table 3). No other missing-data imputation procedure was conducted to address the problem of nonuniform data measurement.

Validation of the MAGIC score and the Ann Arbor score

The MAGIC score was calculated as −11.263 + 1.844(log(ST2)) + 0.577(log(REG3α)). The Ann Arbor score was calculated as −9.169 + 0.598(log(TNFR1)) − 0.028(log(REG3α)) + 0.189(log(ST2)). All the used coefficient values were identical to those reported in the original reports8,9. The original reports did not specify the units for plasma biomarker measurements, and the two scores used different bases for the logarithm (MAGIC, 10; Ann Arbor, 2). None of these, however, affected the AUROC and AUPRC calculations.

daGOAT model

We designed the daGOAT algorithm with the motivation to leverage one presumed nature of post-HSCT time-series data: the underlying biological process for aGVHD onset is multidimensional and smooth with respect to time. By explicitly taking the temporal order of data into account, daGOAT borrows strengths from neighboring time points. Even if a feature is missing a value on one particular day, the model might still learn its contribution to relative risk on that day by interpolating between neighboring time points.

Our model integrates multidimensional time-series data to calculate risk for severe aGVHD onset between t + 1 and t + δ according to

$${\varphi _i}{\left( t \right)} = {\rho }{\left( \bf{{z}_{i}} \right)} + {\mathop {\sum }\limits_{\tau \in \left[ {{t} - {\delta} + {1},\,{t}} \right]}} {{\mathop {\sum }\limits_k} \left( {{I_{ik\tau }} {\theta _k}{\left( {x_{ik}{\left( \tau \right)},\,{\tau} } \right)}} \right)},$$

where ρ(zi) and \({\theta _k}{\left( {x_{ik}{\left( \tau \right)},\,{\tau} } \right)}\) define the contribution of all peri-transplantation features zi and the contribution of the individual dynamic variable xik(τ), respectively, to the relative risk of patient i developing severe aGVHD between t + 1 and t + δ. Iikτ = 0 when xik(τ) lacks value for the ith patient, and Iikτ = 1 otherwise. Although equation (1) does not presume the size of the time step, updating φi(t) daily was deemed sufficient for severe aGVHD prediction.

We fitted daGOAT as follows. First, ρ(zi) was set to be the log-odds ratio computed by the standard Naïve Bayes algorithm. Second, for every k and t, we computed the cutoff value ckt that maximized Shannon’s mutual information between the kth dynamic variable at time t and severe aGVHD occurrence, then we set lk and uk to be the 25th and 75th percentile values among ck1, …, ckT, respectively. This step computed the optimal cutoff values {lk, uk} to discretize the kth dynamic variable. Third, for every k and t we computed

$${\rho }_{1kt}^{\left( {\rm{L}} \right)} = {P}{\left( {{x_k}{\left( t \right)} < {l_k}|{{{\mathrm{Severe}}}}\,{{{\mathrm{aGVHD}}}}\,{{{\mathrm{onset}}}}\,{{{\mathrm{between}}}}\,{t + 1}\,{{{\mathrm{and}}}}\,{t + \delta }} \right)},$$
$${\rho }_{1kt}^{\left( {\rm{H}} \right)} = {P}{\left( {{x_k}{\left( t \right)} > {u_k}|{{{\mathrm{Severe}}}}\,{{{\mathrm{aGVHD}}}}\,{{{\mathrm{onset}}}}\,{{{\mathrm{between}}}}\,{t + 1}\,{{{\mathrm{and}}}}\,{t + \delta }} \right)},$$
$${\rho }_{0kt}^{\left( {\rm{L}} \right)} = {P}{\left( {{x_k}{\left( t \right)} < {l_k}|{{{\mathrm{No}}}}\,{{{\mathrm{severe}}}}\,{{{\mathrm{aGVHD}}}}\,{{{\mathrm{onset}}}}\,{{{\mathrm{between}}}}\,{t + 1}\,{{{\mathrm{and}}}}\,{t + \delta }} \right)},\,{{{\mathrm{and}}}}$$
$${\rho }_{0kt}^{\left( {\rm{H}} \right)} = {P}{\left( {{x_k}{\left( t \right)} > {u_k}|{{{\mathrm{No}}}}\,{{{\mathrm{severe}}}}\,{{{\mathrm{aGVHD}}}}\,{{{\mathrm{onset}}}}\,{{{\mathrm{between}}}}\,{t + 1}\,{{{\mathrm{and}}}}\,{t + \delta} } \right)},$$

then we computed \({\hat \rho _{1k}^{\left( {\rm{L}} \right)}{\left( t \right)}}\), \({\hat \rho _{1k}^{\left( {\rm{H}} \right)}{\left( t \right)}}\), \({\hat \rho _{0k}^{\left( {\rm{L}} \right)}{\left( t \right)}}\) and \({\hat \rho _{0k}^{\left( {\rm{H}} \right)}{\left( t \right)}}\) as ‘smoothed’ versions of \({\rho _{1kt}^{\left( {\rm{L}} \right)}}\), \({\rho _{1kt}^{\left( {\rm{H}} \right)}}\), \({\rho _{0kt}^{\left( {\rm{L}} \right)}}\) and \({\rho _{0kt}^{\left( {\rm{H}} \right)}}\), respectively, by smoothing-spline fitting (smooth with respect to t). This step computed the discretized probability distribution of the kth dynamic variable that was smooth along the time axis. Note that we did not conduct interpolations on the raw data. True distributions of feature values in normal, prodromic and diseased states are unknown, and it is unclear how to best perform interpolations on the raw data for a wide range of variables expressed in various units (for example, should we interpolate values in the original scale or in the log scale?). Instead, our model was distribution-agnostic and performed interpolation on the estimated relative risk contribution terms, that is, in the probability space.

Finally, we defined

$${\theta _k}{\left( {x,\,t} \right)} = \left\{ {\begin{array}{*{20}{l}} {\log \left( {\frac{{{{{\mathrm{max}}}}\left\{ {{{{\mathrm{0,}}}}\,\hat \rho _{1k}^{\left( {\rm{L}} \right)}\left( t \right)} \right\} + \gamma }}{{{{{\mathrm{max}}}}\left\{ {{{{\mathrm{0,}}}}\,\hat \rho _{0k}^{\left( {\rm{L}} \right)}{\left( t \right)}} \right\} + \gamma }}} \right)} \hfill & {{{{\mathrm{if}}}}\,{x} < {l_k}} \hfill \\ {\log \left( {\frac{{{{{\mathrm{max}}}}\left\{ {{{{\mathrm{0,}}}}\,{1} - \hat \rho _{1k}^{\left( {\rm{L}} \right)}{\left( t \right)} - \hat \rho _{1k}^{\left( {\rm{H}} \right)}{\left( t \right)}} \right\} + \gamma }}{{{{{\mathrm{max}}}}\left\{ {{{{\mathrm{0,}}}}\,{1} - \hat \rho _{0k}^{\left( {\rm{L}} \right)}{\left( t \right)} - \hat \rho _{0k}^{\left( {\rm{H}} \right)}{\left( t \right)}} \right\} + \gamma }}} \right)} \hfill & {{{{\mathrm{if}}}}\,{l_k} \le {x} \le {u_k}} \hfill \\ {\log \left( {\frac{{{{{\mathrm{max}}}}\left\{ {{{{\mathrm{0,}}}}\,\hat \rho _{1k}^{\left( {\rm{H}} \right)}{\left( t \right)}} \right\} + \gamma }}{{{{{\mathrm{max}}}}\left\{ {{{{\mathrm{0,}}}}\,\hat \rho _{0k}^{\left( {\rm{H}} \right)}{\left( t \right)}} \right\} + \gamma }}} \right)} \hfill & {{{{\mathrm{if}}}}\,{u_k}{{{\mathrm{ < }}}}x} \hfill \end{array},} \right.$$

where \({{\gamma} \ge {0}}\) is a hyperparameter that we set to be 0.1. When the model was run in the ‘no smoothing’ mode, \({\hat \rho _{1k}^{\left( {\rm{L}} \right)}{\left( t \right)}}\), \({\hat \rho _{1k}^{\left( {\rm{H}} \right)}{\left( t \right)}}\), \({\hat \rho _{0k}^{\left( {\rm{L}} \right)}{\left( t \right)}}\) and \({\hat \rho _{0k}^{\left( {\rm{H}} \right)}{\left( t \right)}}\) were not calculated, and \({\rho _{1kt}^{\left( {\rm{L}} \right)}}\), \({\rho _{1kt}^{\left( {\rm{H}} \right)}}\), \({\rho _{0kt}^{\left( {\rm{L}} \right)}}\) and \({\rho _{0kt}^{\left( {\rm{H}} \right)}}\) were used instead for calculating θk(x, t).

Data generation for simulation experiments

We generated multidimensional time-series data using a simplified version of equation (1):

$${\varphi _i} = {\mathop {\sum }\limits_{{t} \in \left[ {1,\,T} \right]}}{ \mathop {\sum }\limits_{{k} \in \left[ {1,\,p} \right]}} {\theta _k}{\left( {{x_{ik}}{\left( t \right)},\,t} \right)}$$
(2)

where φi is the relative risk of the event of interest happening to individual i, p = 50 and T = 14. To further simplify the simulations (without hurting generalizability), we assumed each xik(t) term had only two possible values: low and high.

First, for each individual i and feature k, a smooth time-series xik(t) was generated by a random walk that meandered between low and high (transition rates: ‘low→low’ or ‘high→high’, 0.7; ‘low→high’ or ‘high→low’, 0.3).

Second, we simulated values for θk(x, t), the underlying process for event generation. When the underlying process was designated to be more complex (that is, a high percentage of observed variables were indeed correlated with event onset), we set θk(x, t) = 0 for \({{k} \in \left[ {46,\,50} \right]}\). On the other hand, when the underlying process was designated to be simple (that is, most observed variables were dummies), we set θk(x, t) = 0 for \({{k} \in \left[ {6,\,50} \right]}\). For all the other k values, when the underlying process was designated to be smooth, we first generated a smooth time-series αk(t) through a random walk with multiplicative Gaussian steps (mean = 1; s.d. = 0.05) starting from αk(1) = 1. On the other hand, when the underlying process was designated to be not smooth, αk(t) at each t was independently drawn from a uniform distribution between 0 and 1. The values of αk(t) were then linearly rescaled so that \({\mathop {{\min }}\limits_{{t} \in \left[ {1,\,T} \right]} {\alpha _k}{\left( t \right)}} = {0.3}\) and \({\mathop {{\max }}\limits_{{t} \in \left[ {1,\,T} \right]} {\alpha _k}{\left( t \right)}} = {0.7}\); then we defined \({\theta _k}{\left( {{{{\mathrm{high}}}},\,t} \right)} = {\rm{log}}{\left( {\frac{{{\alpha _k}{\left( t \right)}}}{{{1} - {\alpha _k}{\left( t \right)}}}} \right)}\) and θk(low, t) = −θk(high, t).

Third, the generated values xik(t) and θk(x, t) were then plugged into equation (2) to calculate φi. For each i, draw a random number γ from a uniform distribution between 0 and 1. If γ<φi, the event of interest happened to individual i; otherwise, the event did not happen to individual i.

Finally, when the data missing rate was designated to be high, 60% of the xik(t) terms were randomly marked as missing; otherwise, there was no data missing.

For each combination of data characteristic parameters, the simulation was repeated 150 times. Each run was conducted with n = 1,000 (80:20 random split for holdout validation) and a constant 15% percentage of positive cases within the simulated sample (a larger cohort of individuals was first generated and then randomly downsampled to the desired size).

The smartphone-based recognition of human activities and postural transitions dataset

The waist-mounted smartphone inertial sensor dataset14 was downloaded from the UCI Machine Learning Repository (https://archive.ics.uci.edu/ml/datasets/Smartphone-Based+Recognition+of+Human+Activities+and+Postural+Transitions). Before download, the smartphone dataset had already been randomly partitioned into two subsets, with 70% of the subjects designated as the training set and the rest as the test set. From this dataset we extracted all 8.96-s time-series segments in which a person was continuously sitting and then classified each segment according to whether the person stood up within the next 5.12 s (‘+’) or not (‘−’). The training set contained 789 data segments (‘+’, n = 92; ‘−’, n = 697), and the test set contained 292 segments (‘+’, n = 40; ‘−’, n = 252).

Reporting Summary

Further information on research design is available in the Nature Research Reporting Summary linked to this Article.