Introduction

When planning a long-awaited vacation, the decision of where to go to have an “optimal” experience depends on many factors including the home base (Hawaii may be more feasible for someone based on the West Coast of North America than someone on the East Coast), previous personal travel experiences, and advice from friends or online reviews. Similarly, although the primary and most desirable goal of cancer or graft-versus-host disease (GvHD) treatment is cure, the decision of how to treat someone to achieve an “optimal” outcome may also depend on numerous subject-level covariates analogous to those in our holiday-planning example: where we are presently (general health and performance status, specific organ function, disease burden, mutation landscape, and immune profile), where we have been (response to and adverse effects of previous treatments, perhaps including molecular and immune biomarker responses), and where we could realistically go (anticipated performance of future treatment options, subject to constraints of availability, cost, etc.). The first and second considerations are based on a person’s current and prior data. The third consideration requires considering outcomes of former patients.

Of course, in many people treatment decisions need to be taken several times in a series of treatment choices using updated data on the subject’s current condition and prior experiences. Examples of everyday decisions haematopoietic cell transplant physicians and immunotherapists encounter include:

  1. (1)

    Treatment of people with acute myeloid leukemia (AML) posttransplant with measurable residual disease or relapse. Many receive hypomethylating therapy, intensive chemotherapy, donor lymphocyte infusions, and/or second transplants in diverse sequences.

  2. (2)

    Immune suppression strategies to prevent and treat acute and chronic GvHD [1,2,3].

  3. (3)

    Sequencing autotransplant vs. chimeric antigen receptor (CAR)-T-cell therapy in persons with advanced lymphomas or sequencing new therapies for chronic lymphocytic leukemia.

To fulfill the goal of precision medicine, adaptive treatment strategies (ATS) use data from observed patient experiences to develop treatment recommendations tailored to a given person. In this tutorial, we give an overview of two specific methods of statistical estimation of ATS, each from one of two broad classes of approaches. We illustrate these approaches using recent analyses of data from the Center for International Blood and Marrow Transplant Research (CIBMTR) [3].

Two approaches to estimating optimal ATS for a single treatment decision

Developing an optimal ATS requires several steps. The first is to clearly define the research question by specifying: (1) how many treatment decision nodes there are and what treatment options are available at each; (2) what subject-level data relevant to the possible decisions are available; (3) what subject-level data were used to assign treatment (e.g., was it a randomized trial or were hospital-specific or country-specific treatment guidelines used); and (4) which outcome(s) is to be optimized. The latter may be difficult to specify. For example, targeting cure or maximizing disease-free survival (DFS) may overlook important considerations such as adverse effects, quality-of-life (QoL), or cost. A utility combining several competing outcomes could be used. However, constructing such a utility requires substantial input from physicians and patients [4].

There are two broad approaches to estimating ATS: (1) regression-based (indirect) methods; and (2) value-search methods. We outline one example of each approach in a hypothetical simple, one-decision setting for a continuous outcome. We then discuss how these approaches are extended to other outcome types and multiple decision stages. We also discuss a recent application to the two-stage decision-making approach to preventing and, if needed, treating GvHD using immune suppressive therapies in the course of an allotransplant for persons with AML or myelodysplastic syndrome.

We begin with notation for the single decision (or single stage) case. We consider a binary treatment Z, pre-treatment covariates X, and a binary outcome Y. For example, we may wish to decide whether to give someone anti-thymocyte globulin (ATG) prophylaxis in addition to standard tacrolimus/methotrexate (ATG is denoted as Z) to maximize GvHD-free, relapse-free survival (GFRS, our Y) taking into consideration factors such as sex match, age, pretransplant conditioning regimen intensity, donor histocompatibility and relatedness, and type of graft (collectively denoted X). In fact, much of the statistical literature on ATS focuses on a continuous outcome, for instance a biological marker or QoL. However, our focus in this example is on outcomes relevant to transplant physicians.

Regression-based (indirect) estimation of a single-stage ATS

One class of estimating approaches to ATS relies on the familiar method of regression. Specifically, a model is fit for the probability of observing the outcome, Y, as a function of the treatment, the covariates, and interactions between these. In general terms, we write:

$${\mathrm{logit}}\left( {{\mathrm{Pr}}\left[ {Y \,=\, 1{\mathrm{|}}Z,X;\, \beta ,\varphi } \right]} \right) \,=\, g\left( {X;\beta } \right) \,+\, \gamma \left( {X,Z;\varphi } \right).$$
(1)

It is common to choose both g and γ to be linear functions. As a trivial example, if we take Z to be 1 when treatment is ATG and 0 otherwise, then we may specify:

$${\mathrm{logit}}\left( {{\mathrm{Pr}}\left[ {Y \,=\, 1{\mathrm{|}}Z,X;\, \beta ,\varphi } \right]} \right) \,= \,\, \left( {\beta _0 \,+\, \beta _1{\mathrm{HLAmismatch}}} \right) \,\\ +\, Z\left( {\varphi _0 \,+\, \varphi _1{\mathrm{HLAmismatch}}} \right).$$

Note that in this equation, \(g\left( {X;\beta } \right) \,=\, \left( {\beta _0 \,+\, \beta _1{\mathrm{HLAmismatch}}} \right)\) describes the impact of an HLA-mismatched donor (relative to that of an HLA-matched donor) on GRFS under standard immune suppression (tacrolimus/methotrexate), i.e., when Z = 0. GRFS (or, more accurately, the logit of the probability of experience GRFS) is altered by \(\gamma \left( {X,Z \,=\, 1;\varphi } \right) \,=\, \left( {\varphi _0 \,+\, \varphi _1{\mathrm{HLAmismatch}}} \right)\) when treatment is, instead, ATG. Note that if \(\left( {\varphi _0 \,+\, \varphi _1{\mathrm{HLAmismatch}}} \right) \,> \, 0\), then the probability of experiencing GRFS will be higher when ATG therapy is used. Thus, if we can estimate the parameters φ0 and φ1, then we could estimate the optimal rule as “treat with ATG when \(\left( {\varphi _0 \,+\, \varphi _1{\mathrm{HLAmismatch}}} \right) \,> \, 0\)”, otherwise, treat with tacrolimus/methotrexate alone. Returning to the general regression formulation given in Eq. (1), this suggests that the optimal rule can be deduced by: (1) estimating regression parameters of Eq. (1); and (2) specifying the optimal rule to be that which assigns treatment Z = 1 whenever \(\gamma \left( {X,Z \,=\, 1;\varphi } \right) \,> \, 0\), assuming that the larger value of Y (in this case, 1) is preferable.

This regression-based approach has the advantage of relying on a familiar statistical tool accompanied by standard approaches to covariate selection, model diagnostics, and so on. This straightforward approach, whether in the single- or multiple-stage setting, is known as Q-learning. However, the approach relies on assuming the model designated to describe Eq. (1) is correctly specified. This implies that all confounders [5] have been measured. Regression-based methods can be made more robust by specifying a very flexible functional form for the regression equation, for example by using splines [6], nonlinear models, or even nonparametric machine learning approaches [7], though the latter may lose some of the interpretability of a more traditional regression model. Regression-based methods can also be made more robust by specifying a propensity score model—i.e., a model for the treatment assignment mechanism—and using this to adjust for confounding so that even if the g(X; β) is not correctly specified, the estimated treatment rule will be consistent for the truth in large samples—provided the propensity score model and the component of the model that specifies the treatment rule, γ(X, Z; φ), are correctly specified. These so-called doubly robust methods include g-estimation [8] and dynamic weighted ordinary least squares [9] although neither of these are well-developed for binary outcomes.

Value-search (direct) estimation of a single-stage ATS

In the ATS literature the expected outcome is sometimes known as the value function. We may consider the value function under a treatment strategy. E.g., \(V^1 \,=\, E\left[ {Y\left( {Z \,=\, 1} \right)} \right]\) represents the value under the treatment strategy “give everyone ATG”, whereas \(V^d \,=\, E\left[ {Y\left( {Z \,=\, d\left( X \right)} \right)} \right]\) where \(d\left( X \right) \,=\, I\left( {\left( {\varphi _0 \,+\, \varphi _1{\mathrm{HLAmismatch}}} \right) \,> \, 0} \right)\) is the probability GRFS if all patients were treated according to the rule “treat with ATG when \(\left( {\varphi _0 \,+\, \varphi _1{\mathrm{HLAmismatch}}} \right) \,> \, 0\)”. Value-search methods aim to estimate the value function directly under a series of candidate treatment strategies, d. These strategies could be linear decision rules such as “treat with ATG when \(\left( {\varphi _0 \,+\, \varphi _1{\mathrm{HLAmismatch}}} \right) \,> \, 0\)”, or could involve nonlinear treatment rules such as “treat with ATG when \(\left( {\vartheta _0 \,+\, 1.2^{Age - \vartheta _1}} \right) \,> \, 0\)” where the latter cannot easily be estimated with traditional regression-based methods.

A classic value-search method relies on inverse probability of treatment weighting (IPTW). In this approach, the propensity score is used to construct weights that are used to remove confounding, which may exist when treatment is not randomly assigned. The analyst first estimates the propensity score model, constructs inverse probability of treatment weights and then computes a weighted average of the outcomes Y for those individuals who followed a given treatment rule, d. Using the same propensity score model and weights, weighted averages are computed for each candidate treatment rule and the resulting estimates are compared with see which candidate rule returns the greatest expected outcome.

Value-search approaches have the advantage of more easily accommodating rules of any form, not simply linear decision rules. However, more sophisticated approaches than IPTW are generally recommended as IPTW estimators of the value (expected outcome) often have large standard errors making it difficult to distinguish the relative benefit of the candidate rules. These approaches include augmented IPTW [10], residual weighted learning [11], and others (e.g., [12]). Value-search methods do not require the true best adaptive decision rule to be among the candidate rules. The approach will, in large samples, simply select the treatment strategy, which results in the best outcome among the candidate strategies being considered.

Two approaches to estimating optimal ATS for a sequence of several treatment decisions

Consider now, a setting where we have to make multiple treatment decisions. Recall, for example, setting (2) above where interest lies in devising the best strategy to prevent and/or treat acute and chronic GvHD. Interest may lie in maximizing the binary outcome of 2-year DFS [3] or maximizing DFS time without restriction to 2 years and allowing censoring [2].

We must now extend notation to accommodate two stages of decision-making, wishing to individualize treatment decisions, considering a binary treatment Z1 indicating use of ATG given pretransplant to prevent GvHD. Pre-treatment covariates X1 are measured and may include recipient-, donor- and disease-related variables. In subjects developing GvHD, further interventions of different intensities will be offered. This second stage of treatment is denoted Z2. Pre-treatment covariates, denoted X2, are again measured before giving GvHD therapy, which may include all or some subset of X1, as well as post-Z1 variables such as time to develop GvHD, current health, investigational biomarkers of GvHD severity and/or functional measures such as the Karnofsky Performance Status score. Again, we consider a binary outcome Y such as 2-year GRFS.

A key concern with multistage interventions is that a treatment may have delayed effects. For example, an intensive therapy may elicit a good short-term response but may compromise subsequent therapy(ies) resulting in a lower long-term success rate [13, 14]. Estimation must therefore proceed either sequentially backwards for regression-based estimators or by considering the ATS for value-search methods.

Regression-based (indirect) estimation of a multistage ATS

Implementation of Q-learning in a multistage decision analysis proceeds by following a general sequential algorithm for a two-stage case:

  1. 1.

    Propose a model for the final outcome as a function of the second stage of treatment, Z2, and any elements of X*2 = (X1, Z1, X2) that (i) may be potential tailoring variables for stage 2 treatment or (ii) may be important predictors of the outcome Y. To be more concise, we can let denote (X1, Z1, X2):

    $$ {\mathrm{logit}}\left( {{\mathrm{Pr}}\left[ {Y \,=\, 1{\mathrm{|}}Z_2,X_2^ \ast ;\beta _2,\varphi _2} \right]} \right) \,\\ \qquad \,\,\,\qquad =\, g_2\left( {X_2^ \ast ;\beta _2} \right) \,+\, \gamma _2\left( {Z_2,X_2^ \ast ;\varphi _2} \right)$$
    (2)

    where the functions g2() and γ2() are analogous to those defined in Eq. (1). More specifically, the contrast function γ2() is used to define a (possibly linear) decision rule at the second decision stage.

  2. 2.

    Estimate the parameters in Eq. (2) and use these to define the optimal (estimated) stage 2 decision rule as “treat with Z2 = 1 whenever \(\gamma _2\left( {Z_2 \,=\, 1,X_2^ \ast ;\hat \varphi _2} \right) \, > \, \gamma _2\left( {Z_2 \,=\, 0,X_2^ \ast ;\hat \varphi _2} \right)\)”, or equivalently “treat with Z2 = 1 whenever \(\gamma _2\left( {Z_2 \,=\, 1,X_2^ \ast ;\hat \varphi _2} \right) \,> \, 0\)”. Through X*2 = (X1, Z1, X2) this rule may account for previous (stage 1) treatment, Z1 and responses to the treatment, contained in X2.

  3. 3.

    We now wish to estimate the optimal first-stage decision. However, using principles much like those in traditional randomized clinical trials, we do not wish to condition on or adjust for any post-(stage 1) treatment variables. We do not wish to condition on the second stage treatment and yet we must ensure that comparisons between the two stage 1 treatment options are “fair” and not simply a reflection of later, downstream treatments. We accomplish this by creating a new, pseudo-outcome we denote \(\tilde Y_1\) which we generate for each individual in the sample according to:

    $$\tilde Y_1 \,=\, {\mathrm{max}}\left\{ {\mathrm{Pr}}\left[ {Y \,=\, 1{\mathrm{|}}Z_2 \,=\, 0,{X_{2}^{\ast}} ;\hat \beta _2,\hat \varphi _2} \right],\,\right.\\ \left. {\mathrm{Pr}}\left[ {Y \,=\, 1{\mathrm{|}}Z_2 \,=\, 1,X_2^ \ast ;\hat \beta _2,\hat \varphi _2} \right] \right\}.$$

    That is, the pseudo-outcome is the estimated “best possible” probability of the outcome an individual could have based on the estimates for the outcome model specified in Eq. (2). Using this pseudo-outcome is equivalent to performing a stage 1 analysis in a world where all individuals in the sample were treated optimally at the second stage. Note that in this “optimal treatment world”, not everyone would receive the same treatment (ATG or standard), but all would be treated according to the same rule (“treat with Z2 = 1 whenever \(\gamma _2\left( {Z_2 \,=\, 1,X_2^ \ast ;\varphi _2} \right) \, > \, 0\)”).

  4. 4.

    Propose a model for the pseudo-outcome as a function of first stage of treatment, Z1, and X1:

    $$ {\mathrm{logit}}\left({\mathrm{Pr}}\left[{\tilde{Y}}_{1} \,=\, 1 {|} Z_1,X_1;\beta_1,\varphi _{1} \right] \right) \\ \qquad \qquad \quad \!=\, g_1\left( {X_1;\beta _1} \right) \,+\, \gamma _1\left( {Z_1,X_1;\varphi _1} \right)$$
    (3)

    where again g1() and γ1() are analogous to the functions in Eq. (1).Footnote 1

  5. 5.

    Estimate the parameters in Eq. (3), and use these to define the optimal (estimated) stage 1 decision rule as “treat with Z2 = 1 whenever \(\gamma _1\left( {Z_1 \,=\, 1,X_1;\hat \varphi _1} \right) \, > \, 0\)”.

The above algorithm can be adapted to more than two stages simply by computing a new pseudo-outcome for all stages other than the final stage. The final sequence of treatment rules is made up of a sequence with components consisting of “treat with Zj = 1 whenever \(\gamma _j( {Z_j \,=\, 1,X_j^ \ast ;\hat \varphi _j} ) \, > \, 0\)” for each treatment stage j. Other regression-based forms of estimation vary in how the pseudo-outcome is constructed. However, the basic principles of the backwards inductive approach remain.

Value-search estimation of a multistage ATS

The extension of the simple, inverse probability of treatment-weighted estimator to multiple stages is straightforward. As in the single-stage setting, a set of candidate treatment strategies must be posited by the analyst. Propensity score models are fit—now at each decision stage—and inverse probability of treatment weights are constructed at each interval and then multiplied together. As in the single-stage setting, for each candidate strategy of interest, those individuals in the sample who were observed to follow the treatment strategy under investigation are used to compute a weighted average of the outcomes Y, thus yielding an estimate of the value function for that strategy. The resulting estimates of the value functions for each candidate strategy are compared to see which returns the greatest expected outcome.

As in the single-stage setting, there are numerous alternatives to the simple IPTW approach, many of which include some form of outcome modeling and consequently offer greater precision in the estimated value function and thus the choice of preferred treatment strategy.

Further extensions

The methods we describe along with related methods have been extended in several ways. Beyond two treatment options, one can consider multiple distinct treatments or even continuous treatment [15, 16]. Censored continuous outcomes—e.g., “survival time”—can also be accommodated with censoring handled by assuming independent censoring or by using inverse probability of censoring weights [17,18,19,20]. Within regression-based approaches this is done by assuming appropriate models in Eqs. (2) and (3). For example, a Cox model could be assumed for a survival outcome. Except for a few instances (e.g., [3, 21]), other outcome types such as counts or binary outcomes have rarely been considered. We are unaware of any analyses or methods that have aimed to optimize a binary outcome over more than two stages of intervention.

The form of the outcome model can be very general, particularly for Q-learning. For example, we considered a parametric model for DFS time that allowed for a fraction of individuals to be cured [2]. This model allowed for treatment to interact with covariates differently in the cure and the survival components of the model. This flexible approach revealed that although on average ATG therapy is not a preferred treatment choice for either GvHD prevention or treatment (see Fig. 1) when the outcome being considered is DFS, significant numbers of people may benefit from pretransplant ATG and a much smaller fraction of patients with GvHD may benefit from ATG treatment (Fig. 2).

Fig. 1: Survival comparison for ATG vs. standard GvHD prophylaxis and treatment.
figure 1

Kaplan–Meier curves show clear overall benefit to standard therapy over ATG at both the first treatment stage (left panel, GvHD prophylaxis) and second treatment stage (right panel, treatment of established GvHD).

Fig. 2: Difference in expected disease-free survival if not cured (y-axis) vs. difference in probability of cure (x-axis) if given ATG instead of standard treatment.
figure 2

For a small number of people, ATG may be preferred as it confers a higher probability of being cured (no death or relapse) than standard treatment, particularly at the first stage (left panel, prophylaxis), with much smaller numbers expected to benefit at the second GvHD stage (right panel, GvHD treatment). Black crosses indicate patients for whom ATG is the preferred prophylaxis or GvHD treatment.

Algorithm validation

Oncologists are already familiar with trials that aim to evaluate two treatments, using either factorial designs [22] or using sequential randomizations where nonresponders are randomized to different salvage therapies or responders are randomized to different consolidation or maintenance treatments or to maintenance vs. no intervention (e.g., [23, 24]). Unfortunately, results of first and second randomizations are often published in separate articles (for example [25,26,27,28]). Consequently, insights which might emerge by considering the trajectories as a whole are lost. The implication is that the infrastructure for conducting trials with multiple treatment assignments and sequential randomizations already exists.

The Sequential Multiple-Assignment Randomized Trial (SMART) is the preferred trial design to develop ATS that could serve as decision support tools and practice guidelines. Compared with developing ATS through retrospective analysis of medical records and registry data, prospective development of ATS through SMART clinical trials has the advantage that randomization reduces selection bias and confounding-by-indication along with reducing the risk of unmeasured confounders. In a SMART, participants are randomized to one of a pre-defined list of treatment options at each critical decision node. This approach allows discovery and testing of tailoring variables while also assessing comparative efficacy of different treatments. SMARTs are used to identify which subset of X1 and X2 predicts a good or poor response to a given treatment at the respective decision stage.

To validate an ATS, whether it was developed in a SMART or through retrospective analyses, would require a subsequent conventional randomized trial. For example, subjects could be randomized between the ATS and a different “standard-of-care” treatment sequence. Alternatively, the ATS could be tested in “ecological studies” where it is implemented in some hospitals or over some period, and the outcome of subjects treated under the ATS compared with the outcome of contemporaneous subjects treated in different hospitals that did not use the ATS or to a historical cohort.

Conclusion

In this brief review, we introduced two simple forms of analysis for estimating optimal ATS. These methods can be applied to nonexperimental data such as those arising from clinical practice or registries or can be applied to randomized trials. In particular, the SMART design is specifically targeted at designing treatment algorithms for tailored interventions with multiple decision points [29,30,31].

An important point to keep in mind is that, in general, analyses aimed at uncovering ATS are exploratory rather than confirmatory in nature. SMARTs may be confirmatory in nature but are typically powered for tailoring only on a very small number of covariates such as response to first-stage treatment. Nevertheless, these methods can identify candidate strategies and tailoring variables that appear promising and discard other clearly suboptimal strategies. As big data, expanded access to anonymized electronic medical records and incorporation of novel biomarkers into clinical decision making become the norm, ATS approaches to developing decision support tools are becoming increasingly feasible and potentially useful, both for ‘simple’ decisions and complex ones.