Dear Editor,

Two different randomized controlled trials (RCTs) have compared, head-to-head, the efficacy and safety of Bruton tyrosine kinase inhibitors (BTKis) in chronic lymphocytic leukemia (CLL); in both these studies, the first-generation BTKi ibrutinib was used as the comparator arm. ELEVATE-RR (NCT02477696), a multicenter, randomized, open-label, noninferiority phase 3 trial, compared acalabrutinib vs. ibrutinib in patients with previously treated, high-risk [presence of del(17p) and/or del(11q)] CLL [1]. In this study, acalabrutinib met its primary endpoint of progression-free survival (PFS) noninferiority (hazard ratio [HR]:1.0; 95% confidence interval [CI], 0.79–1.27) with a median PFS of 38.4 months in both arms. Acalabrutinib demonstrated improved tolerability with fewer cardiovascular adverse events (AEs) vs. ibrutinib.

ALPINE (NCT03734016) was a global, randomized, open-label phase 3 trial designed to assess the superiority of zanubrutinib over ibrutinib in patients with relapsed/refractory (R/R) CLL or small lymphocytic lymphoma [2,3,4]. In the ALPINE intent-to-treat population, zanubrutinib demonstrated superior PFS compared with ibrutinib when assessed by either an independent review committee (IRC) or by the investigator (INV) [2]. In high-risk patients with del(17p)/TP53 mutation, as well as across other major subgroups, PFS favored zanubrutinib. Furthermore, zanubrutinib had an improved safety profile compared with ibrutinib with a lower rate of treatment discontinuation and fewer cardiac disorder events, including fewer deaths.

Comparison of ibrutinib arms across separate trials can be made using matching-adjusted indirect comparison (MAIC) methodology, where individual patient-level data (IPD) from one trial are combined with published aggregate data from another trial, followed by propensity score weighting. Baseline characteristics of patients with IPD are weighted, and IPD are reanalyzed to match outcome definitions in the aggregate data [5]. A recent indirect comparison of the ibrutinib arms across the ALPINE, ELEVATE-RR, and RESONATE (ibrutinib vs. ofatumumab) trials using MAIC methodology implied that ibrutinib underperformed in ALPINE [6]. The analysis matched key patient baseline characteristics including age ≥75 years, bulky disease, prior treatments, β2-microglobulin, and del(11q) or del(17p) status but omitted other characteristics critical for appropriate cross-trial comparisons, such as sex, TP53 and immunoglobulin heavy chain variable (IGHV) mutation status, complex karyotype, and Binet stage.

The present study compared the efficacy of the ibrutinib arms across the ALPINE and ELEVATE-RR trials using MAIC methodology and a more comprehensive list of matching variables to address the underperformance of ibrutinib within ALPINE reported by Ghia et al. [6]. As there was no common comparator between ALPINE and ELEVATE-RR when comparing the efficacy of the ibrutinib arms, this study used an unanchored MAIC, which was conducted inline with published recommendations [5]. The ALPINE ibrutinib arm IPD (N = 325) were filtered to include patients who met the inclusion criteria of ELEVATE-RR (i.e., R/R CLL with del(17p) or del(11q) deletions). The resulting sample (N = 123) was re-weighted to align the distribution of relevant effect modifiers (EMs) and prognostic factors (PFs) with published aggregate data for the ibrutinib arm of ELEVATE-RR (N = 265) [1, 2]. Weights were determined using propensity scores. The MAIC was designed to adjust for all relevant EMs and PFs, which were identified based on a review of the impact of different subgroups analyzed in previous CLL trials and confirmed with clinical experts. The selected parameters for propensity score weighting in the base case were del(17p), del(11q), TP53 mutation status, IGHV mutation status, serum β2-microglobulin, number of prior therapies, and Binet stage. Re-weighted IPD were used to calculate adjusted efficacy outcomes in ALPINE. Weighted HRs were estimated to compare PFS-IRC, PFS-INV, and overall survival (OS) between the ibrutinib arms in ALPINE and ELEVATE-RR. Pseudo IPD of time to event outcomes for the ibrutinib arm of ELEVATE-RR were reconstructed from Kaplan-Meier curves reported in the ELEVATE-RR publication using the algorithm by Guyot et al. [7]. HRs of time to event outcomes were estimated from a weighted Cox model (i.e., comparing weighted ibrutinib ALPINE data against the pseudo IPD of ibrutinib in ELEVATE-RR). Nominal p values were reported for descriptive purposes.

Sensitivity analyses were performed to assess the robustness of the base case results. In the first sensitivity analysis, additional EMs and PFs, including age, sex, complex karyotype, bulky disease, and Eastern Cooperative Oncology Group Performance Status were adjusted. In a second sensitivity analysis, ALPINE PFS and OS were adjusted for COVID-19 impact, as ALPINE was conducted during the COVID-19 period and ELEVATE-RR follow-up data (included in this analysis) were mostly collected before the COVID pandemic. This was achieved by censoring the patients who died due to COVID-19 at the most recent disease assessment prior to death or at the death due to COVID-19.

Baseline characteristics of the populations before matching and a comprehensive summary of EMs and PFs adjusted in the base case and the sensitivity analyses are summarized in Table 1A. Matching the two populations reduced the effective sample size (ESS) from 123 to 63 in the base case analysis.

Table 1 Baseline characteristics before matching and after adjustment for the ibrutinib arms of ALPINE and ELEVATE-RR (A) and PFS-IRC, PFS-INV, and OS (B) in the base case and sensitivity analyses.

The base case PFS-IRC, PFS-INV, and OS for the ibrutinib arms of ALPINE and ELEVATE-RR are shown in Fig. 1 and Table 1B. After matching (median follow-up, 28.4 months), no statistically significant differences were observed in PFS-IRC (HR [95% CI] = 0.80 [0.49–1.28], p = 0.3485), PFS-INV (HR [95% CI] = 1.18 [0.75–1.86], p = 0.4827), or OS (HR = 0.91 [0.50–1.65], p = 0.7539) between the ibrutinib arms of ALPINE and ELEVATE-RR. Findings from the sensitivity analyses for additional EMs and PFs and the COVID-19 adjustment were consistent with those observed for the base case (Table 1B).

Fig. 1: Survival outcomes.
figure 1

A PFS-IRC. B PFS-INV. C OS. aGiven the availability of both IRC- and INV-assessed data. CI confidence interval, CLL chronic lymphocytic leukemia, COVID-19 coronavirus disease-19, EM effect modifier, HR hazard ratio, INV investigator, IPD individual patient-level data, IRC independent review committee, MAIC matching-adjusted indirect comparison, ORR overall response rate, OS overall survival, PF prognostic factor, PFS progression-free survival, RCT randomized clinical trial, R/R relapsed refractory, SLL small lymphocytic lymphoma.

While no significant differences were observed between the efficacy outcomes in the ibrutinib arms of ALPINE and ELEVATE-RR, ibrutinib in ALPINE showed numerical “overperformance” compared to ELEVATE-RR with regards to PFS-IRC and OS. This trend could be observed for the base case and sensitivity analyses, where the HRs of PFS-IRC and OS for the ibrutinib arms of ALPINE vs. ELEVATE-RR were always below 1. These observations highlight the importance of considering both PFS-IRC and PFS-INV for unanchored MAICs, where available, as the conclusions may change when using different PFS measurements. However, given that both ALPINE and ELEVATE-RR were open-label trials, PFS-IRC is a preferred endpoint [8, 9].

Findings from the present study contrast with the results of the previous MAIC [6]. The MAIC results from Ghia et al. showed that PFS and overall response rate outcomes for ibrutinib were consistent between RESONATE and ELEVATE-RR but ibrutinib “underperformed” in ALPINE. Findings here demonstrate an equivalence. The disparate findings between the present and previous study may be attributed to differences in the EMs and PFs adjusted for in the MAIC analyses. Several important patient characteristics such as sex, IGHV mutation status, TP53 mutation status, complex karyotype, and Binet stage were not considered in the Ghia study. Presence of complex karyotype, advanced Binet stage, unmutated IGHV, del(11q), and TP53 abnormalities are high-risk markers for CLL [10]. Failure to appropriately identify and select EMs and PFs in MAICs may result in biased or uncertain effect estimates, impacting the validity of the analysis [11].

Indirect treatment comparisons such as MAICs provide useful information on the comparative efficacy of treatments evaluated in separate trials, potentially filling evidence gaps for health technology assessments [5, 12]. However, due to limitations (modeling assumptions and cross-trial differences in baseline characteristics) and confounding associated with these methodologies, MAIC analyses cannot replace the gold standard of RCTs, should be interpreted with caution, and be viewed as observational and hypothesis-generating [5, 13].

Like any other MAIC, this study had some limitations. Notably, the ESS of the ibrutinib arm in ALPINE was reduced to 63 after filtering out the non-high-risk patients and conducting the matching and adjustment. The study was by nature limited to the high-risk ALPINE population, which reduced the starting sample size. The ESS was further decreased as all important baseline characteristics were considered for accurate comparisons. Despite the small ESS, results were consistent across the multiple sensitivity analyses tested.

The present study did not evaluate the efficacy of the ibrutinib arm of RESONATE. Given both ELEVATE-RR and ALPINE are more contemporary trials that compare a next-generation BTKi to ibrutinib, ELEVATE-RR was deemed more suitable for this comparison. We would expect ibrutinib to perform slightly better in RESONATE compared to ALPINE, potentially due to (1) the difference between RESONATE and ALPINE with regards to geographic distribution of patients and (2) ibrutinib was the only BTKi available in clinical trials at the time of RESONATE, with the only alternatives being standard of care chemotherapies, possibly leading to enhanced adherence.

In conclusion, this MAIC used a comprehensive list of matching variables to compare the efficacy of the ibrutinib arms in ALPINE and ELEVATE-RR, showing no significant difference in the performance of ibrutinib across the two trials. Results were robust in all sensitivity analyses. While MAICs provide a basis for hypothesis generation with regards to treatment efficacy across trials, they are not a substitute for head-to-head RCTs, as they cannot balance all observable and unobservable differences at baseline. Consequently, ultimate evidence of relative efficacy must be sought within RCTs.