Main

In patients with heart failure, kidney function is a powerful independent predictor of future heart failure hospitalization and death, irrespective of left ventricular ejection fraction (LVEF)1,2,3,4. The natural history of heart failure is characterized by progressive worsening of the syndrome over time and this usually includes worsening of kidney function3,5,6,7. Kidney function also influences whether life-saving pharmacological treatments, including renin–angiotensin system blockers and mineralocorticoid receptor antagonists (MRAs), can be initiated and continued in patients with heart failure and it determines eligibility for transplantation and mechanical circulatory support8,9,10,11,12,13,14,15,16,17,18. It is therefore important to understand the effect that new therapies for heart failure have on kidney function; an aspiration with any treatment for heart failure is to at least preserve and, ideally, improve kidney function.

Unfortunately, few trials in patients with heart failure have been large enough and long enough to accrue a sufficient number of ‘hard’ kidney endpoints to allow a statistically robust evaluation of these outcomes using conventional statistical approaches, for example, time-to-first-occurrence of death, end-stage kidney disease (ESKD) or a large decline in estimated glomerular filtration rate (eGFR)19,20,21,22,23. The rate of decline over time (slope) in eGFR has been used as an alternative means of evaluating the effect of treatment on kidney function; however24,25,26, while statistically more powerful, this measure does not incorporate death or initiation of renal replacement therapy and the clinical relevance of small changes in eGFR slope have been questioned.

The use of hierarchical composite endpoints analyzed with win statistics may solve some of these problems by integrating death, relatively infrequent major kidney events (for example, ESKD), the occurrence of large changes in eGFR that are somewhat more frequent, and changes in the eGFR slope, with each of these components ordered in a hierarchy reflecting their clinical importance27,28,29. The hierarchical composite outcome created by this approach consists of components, all of which reflect the progression of kidney disease, and this endpoint is both clinically relevant and statistically powerful30.

In this post hoc study, we evaluated the effects of dapagliflozin on kidney function in patients with heart failure and reduced ejection fraction, and heart failure and mildly reduced or preserved ejection fraction31,32, using a hierarchical composite kidney outcome, analyzed using win statistics.

Results

Of the 11,004 participants included in the Dapagliflozin and Prevention of Adverse Outcomes in Heart Failure (DAPA-HF) and Dapagliflozin Evaluation to Improve the Lives of Patients with Preserved Ejection Fraction Heart Failure (DELIVER) trials, 4,742 were enrolled in DAPA-HF and 6,262 in DELIVER. Participants were assigned equally to dapagliflozin (n = 5,503) or placebo (n = 5,501).

Participants

The participant characteristics according to the randomized treatment groups were well-balanced at baseline (Table 1). In the pooled dataset, there were 1,111 composite events of all-cause mortality, a decline of ≥40% in eGFR or ESKD or an eGFR <15 ml min−1 1.73 m2, in the dapagliflozin group, and 1,151 events in the placebo group; in the DAPA-HF trial, there were 458 in the dapagliflozin group and 509 in the placebo group; in the DELIVER trial, there were 653 in the dapagliflozin group and 642 in the placebo group (Table 2). The effects of dapagliflozin on conventional composite outcomes, analyzed as the time-to-first event, are shown in Table 2. In the pooled dataset, the total eGFR slope in the dapagliflozin group was significantly lower than in the placebo group (−1.77 ± 0.07 (mean ± s.e.) versus −2.28 ± 0.07 ml min−1 1.73 m2 per year, P < 0.001) (Table 2 and Extended Data Fig. 1). Similarly, in DAPA-HF and DELIVER separately, the total eGFR slope in the dapagliflozin group was significantly less steep than in the placebo group (DAPA-HF, −2.76 ± 0.11 (mean ± s.e.) versus −3.22 ± 0.11 ml min−1 1.73 m2 per year, P < 0.001; DELIVER, −1.03 ± 0.08 (mean ± SE) versus −1.56 ± 0.08 ml min−1 1.73 m2 per year, P = 0.004).

Table 1 Participant characteristics in the pooled DAPA-HF and DELIVER dataset
Table 2 Outcomes analyzed using conventional statistical approaches

Win ratio and proportion of wins and losses in each tier

The effects of dapagliflozin on the hierarchical composite kidney outcome, as estimated using win statistics, are summarized in Fig. 1. The hierarchical composite kidney outcome included the following tiers: (1) all-cause mortality; (2) ESKD or eGFR <15 ml min−1 1.73 m2; (3) a decline in eGFR of ≥57%; (4) a decline in eGFR of ≥50%; (5) a decline in eGFR of ≥40%; and (6) participant-level eGFR slope. The win ratio was 1.10 (95% confidence interval (CI) = 1.06–1.15) in the pooled dataset, 1.08 (95% CI = 1.01–1.16) in DAPA-HF dataset and 1.12 (95% CI = 1.05–1.18) in the DELIVER dataset, demonstrating that dapagliflozin was superior to placebo with regard to the hierarchical composite kidney outcome compared in all three analyses. The eGFR slope accounted for most wins and losses, and incorporation of the participant-level eGFR slope in this model reduced the proportion of ties that would have occurred (in 63.4% of pairs in the pooled DAPA-HF and DELIVER dataset). The net benefit was 4.8% (95% CI = 2.7–7.0%) in the pooled dataset, 4.0% (95% CI = 0.7–7.3%) in the DAPA-HF dataset and 5.5% (95% CI = 2.6–8.4%) in the DELIVER dataset.

Fig. 1: Effect of dapagliflozin on the hierarchical composite kidney outcome.
figure 1

Win statistics were two-sided. Models were stratified according to diabetes status (and according to trial in the pooled dataset). Adjustments were not made for multiple comparisons. The exact P values were 0.00001 in the pooled dataset and 0.0002 in the DELIVER dataset.

Sensitivity analyses

In the sensitivity Model 1 analysis, which excluded the tier for a decline in eGFR of ≥40%, win ratios remained higher than 1.0 for participants in the pooled dataset, and in the DAPA-HF and DELIVER trials separately (Extended Data Fig. 2). In sensitivity Model 2, which excluded both the tier for a decline in eGFR of ≥40% and the eGFR slope, the lower CIs of the win ratios and win odds (accounting for ties because of the exclusion of the eGFR slope) were not higher than 1.0 in the DELIVER dataset (Extended Data Fig. 3). The win ratios obtained using sensitivity Model 2 were similar to the 1/hazard ratios (HRs) for the composite kidney endpoints estimated using conventional statistical approaches and evaluated with the similar composite of all-cause mortality, ESKD or eGFR <15 ml min−1 1.73 m2, or decline in eGFR of ≥50% (Table 2). Adding the eGFR slope back into sensitivity Model 2 increased the net benefit from 1.7% to 5.3% in the pooled dataset. In sensitivity Model 3, which excluded all-cause mortality, almost identical results to the main model were observed in the pooled dataset, and the DAPA-HF and DELIVER datasets separately (Extended Data Fig. 4).

Proportions of wins and losses over time

For all-cause mortality, differences in the proportion of wins and losses between treatments increased gradually over time in the pooled dataset, and in the DAPA-HF and DELIVER datasets separately (Fig. 2). In the three datasets, the proportion of losses with dapagliflozin for a decline in eGFR of ≥40% was larger than that of wins, but this difference narrowed over time. The proportions of wins and losses for ESKD or an eGFR <15 ml min−1 1.73 m2, and declines in eGFR of ≥57% and ≥50%, were small and differed little throughout the follow-up. For comparison, the effects of dapagliflozin versus placebo, plotted using the Kaplan–Meier method are shown in Extended Data Fig. 5.

Fig. 2: Proportion of wins and losses over time.
figure 2

Each figure was plotted every 10 days for up to 720 days.

Win ratio and proportions of wins and losses in the subgroups

Win ratios, and the proportion of wins and losses, in the dapagliflozin groups according to a history of type 2 diabetes (T2D), eGFR category (<60 versus ≥60 ml min−1 1.73 m2) are shown in Fig. 3. The treatment effect estimate from the win ratio analysis was consistent across these subgroups, that is, there were no apparent differences in the estimates.

Fig. 3: Effect of randomized treatments on the hierarchical composite kidney outcome according to selected subgroups.
figure 3

Models were stratified according to diabetes status (and according to the trial in the pooled dataset). The squares indicate the win ratios and the bars indicate the upper and lower boundaries of the 95% CI.

Power analysis

When using a hierarchical composite endpoint, sample size requirements are smaller than the time-to-first composite endpoint evaluated using the Cox proportional hazards model (Extended Data Fig. 6).

Discussion

These post hoc analyses show how win statistics can be used to demonstrate the benefit of a treatment for heart failure (in this case, dapagliflozin) on kidney function in patients with both heart failure and reduced ejection fraction and heart failure and mildly reduced or preserved ejection fraction. It is generally difficult to demonstrate the potential kidney benefits of cardiovascular drugs using a conventional renal endpoint because of the small number of events in an ‘unenriched’ population (for example, without albuminuria) during a relatively short-term follow-up. In such a setting, the hierarchical composite endpoint examined in the present study provides greater statistical power and may offer the opportunity to demonstrate both cardiovascular and kidney benefits in the same population in the same trial. In addition to the summary of win statistics usually shown in analyses of this type, we also presented the proportion of wins and losses over time, similar to the depiction of event rates over time provided using traditional statistical methods.

Although superficially similar, the win statistics approach used in this study differs substantially from time-to-first-event analysis for a composite endpoint. The most obvious difference is that events are analyzed according to a hierarchy27,28. All-cause mortality was the most significant event in the composite hierarchical outcome and was tested as the first tier in the hierarchy. Unlike time-to-first-event analysis, the win statistics approach includes all deaths, including those occurring after a worsening kidney disease event. With the win statistics approach, a hierarchy of worsening kidney disease events was also created, reflecting their clinical importance, for example, the development of ESKD or an eGFR <15 ml min−1 1.73 m2, and large decreases in eGFR. As a further refinement, it is also possible to extend the hierarchy to include different proportional declines in eGFR; in the present analysis, we incorporated declines in eGFR of ≥57%, ≥50% and ≥40%. An additional advantage of win statistics is that the hierarchical composite outcome can logically incorporate continuous variables such as the eGFR slope27,28,29. Because the statistical power for conventional composite kidney outcomes is often insufficient when analyzing events such as those discussed above (because of their low incidence rate in some populations), analysis of the eGFR slope has been suggested as an alternative19,20,21,22,24. However, the eGFR slope is evaluated as a single ‘stand-alone’ outcome; its interpretation alongside other more important kidney endpoints simultaneously may not be easy. By contrast, the win statistics approach provides an outcome that integrates all relevant outcomes and all patients contribute to the analysis. One issue with the eGFR slope, either as a stand-alone endpoint or part of the win ratio approach, is that some drugs may cause an initial decline in eGFR33,34,35. The slope after initiation may more accurately reflect the chronic effect of these drugs, but may overestimate treatment benefit36,37; thus, more appropriately, we calculated the eGFR slope over the whole treatment period using a piece-wise, linear, two-slope model accounting for the effects of the acute and chronic phases38.

A closer look at the proportion of wins and losses revealed several findings. Despite the less steep eGFR slope with dapagliflozin compared to placebo, the proportion of wins with dapagliflozin (over placebo) for tier 5 of the hierarchy (that is, a decline in eGFR of ≥40%) was lower than the proportion of losses. The probable explanation for this is that DAPA-HF and DELIVER did not have an active run-in period and the initial drop in eGFR in some patients randomized to dapagliflozin led to a decline in eGFR counting as an ‘event’31,32,39,40,41,42. On examining the proportion of wins and losses over time, it can also be seen that the difference in the tier representing a decline in eGFR of ≥40%, which may reflect the initial drop with dapagliflozin early after randomization, was progressively smaller over time in the DAPA-HF and DELIVER, supporting this explanation and identifying the longer-term benefit of dapagliflozin on the kidney. Indeed the kidney benefits of both these drugs were more apparent over time, observed as the changing proportion of wins and losses over time, which is analogous to the divergence of Kaplan–Meier plots using conventional analysis.

Win statistics are a relatively new approach to analyzing trial data and may still be unfamiliar to some physicians43,44. However, their use is increasing rapidly, particularly in cardiovascular medicine; several recent trials had primary endpoints analyzed using win statistics45,46,47,48,49,50,51,52. At least one treatment has received regulatory approval based on a trial of this type45. Next, there is always debate about which components to include in a hierarchical composite outcome and these should be discussed between the relevant stakeholders, including patients, clinical trialists, and regulatory and reimbursement agencies. Although all-cause mortality is usually included as the first tier in such analyses, it could be argued that this is not a kidney-specific outcome30. To address this concern, we added a sensitivity analysis excluding all-cause mortality from the hierarchy, which showed essentially the same findings. Third, treatments may not affect each component of a composite outcome equally, although this is also an issue with composite endpoints evaluated using conventional statistics. Therefore, it is important to examine the proportion of wins or losses for each component of the composite to interpret the overall result.

This study has several limitations. eGFR was obtained at different scheduled visits in the two trials, while the incidence of the renal endpoints defined according to eGFR may have been affected by the frequency of the eGFR measurements. The hierarchical composite renal outcome used in this study was created post hoc. However, the selected hierarchy reflected the natural progression of kidney disease. It was validated in multiple sensitivity models and by comparison with the analysis of a conventional composite outcome analyzed using a standard method. The thresholds for declines in eGFR were also decided post hoc; thus, ‘sustained’ eGFR decline could not be confirmed using repeat measurement. The eGFR slope may also have been affected by the number of scheduled visits, visit intervals and the follow-up period in each trial.

In conclusion, it was possible to create a comprehensive, multicomponent, hierarchical composite kidney endpoint that is both clinically relevant and statistically powerful when analyzed using win statistics. With this approach, we confirmed the benefits of dapagliflozin on kidney function in patients with heart failure. This benefit was observed regardless of LVEF, baseline eGFR and T2D status. This approach can improve the power and precision around the estimate of effects on kidney outcomes and should be considered in future heart failure trials.

Methods

Study participants

In this post hoc study, we analyzed the DAPA-HF and DELIVER trials31,32. These were randomized, double-blind, placebo-controlled trials, and the trial designs and primary results have been published elsewhere31,32,39,40,41,42.

Briefly, DAPA-HF and DELIVER compared dapagliflozin to placebo in patients with a diagnosis of heart failure. Both trials enrolled patients with NYHA functional classes II–IV and elevated natriuretic peptide levels. The main difference between the two trials was that patients with an LVEF of ≤40% were randomized in the DAPA-HF trial and those with an LVEF >40% were randomized in the DELIVER trial. (DELIVER had evidence of structural heart disease, defined as either left atrial enlargement or left ventricular hypertrophy.) Key exclusion criteria included an eGFR lower than <30 ml min−1 1.73 m2 in DAPA-HF and an eGFR <25 ml min−1 1.73 m−2 in DELIVER. In both trials, participants were randomized to receive dapagliflozin 10 mg once daily or a matching placebo. The median follow-up period was 1.5 years in the DAPA-HF trial and 2.3 years in the DELIVER trial.

Both trials were approved by the ethics committees at each investigative site and written informed consent was obtained from each participant.

Study outcomes

The primary outcome was a composite of death from cardiovascular causes or worsening heart failure in DAPA-HF and DELIVER. In both trials, all-cause mortality was included as a secondary outcome, and a composite kidney outcome was included as a secondary outcome or prespecified exploratory outcome. All death events were adjudicated. The definition of ESKD in each trial was prespecified as a sustained eGFR <15 ml min−1 1.73 m2, chronic dialysis treatment or kidney transplantation in DAPA-HF and adverse event reporting, or a sustained eGFR <15 ml min−1 1.73 m2 in DELIVER. The endpoints driven by the eGFR were derived from central laboratory results.

In this post hoc analysis, we examined a hierarchical composite outcome including the following components: all-cause mortality (tier 1); ESKD or eGFR <15 ml min−1 1.73 m2 (tier 2); a decline in eGFR of ≥57% (tier 3); a decline in eGFR of ≥50% (tier 4); a decline in eGFR of ≥40% (tier 5); and participant-level eGFR slope (tier 6) (Extended Data Table 1). All-cause mortality was used for tier 1 in the hierarchy because of its ultimate clinical importance and its competing risk for the remaining outcomes. Considering the outcomes proposed by the international consensus definition of clinical trial outcomes for kidney disease, ESKD (or equivalent status) and decline in eGFR with different cutoffs were applied as tiers 2–5 (ref. 53). Decline in eGFR was applied as tier 6 because this has also been used for regulatory approval of treatment in some chronic kidney disease settings24,25,26. To address concerns regarding the lack of short-term verification of a change in eGFR due to the long interval between the scheduled study visits (and because some cutoffs were not verified as they were prespecified), declines in eGFR not requiring evidence that they were sustained eGFR were also evaluated. That is, change in eGFR (tiers 2–5) was evaluated as the time to the first meeting of the eGFR criterion based on the scheduled study visits, with the last laboratory assessment date used for censoring. eGFR was scheduled to be obtained at randomization, 14 days, 2 months, 4 months, 8 months, 12 months, 16 months, 20 months and 24 months in the DAPA-HF trial; and at randomization, 1 month, 4 months, 12 months, 24 months and 36 months in the DELIVER trial. The eGFR at randomization was used as the baseline eGFR to evaluate the change in eGFR; participants without baseline eGFR were excluded, that is, two participants in the DAPA-HF trial and one participant in the DELIVER trial. In this study, the original definition of ESKD in each study was used, alongside the aforementioned evaluation of change in eGFR.

As sensitivity analyses, we analyzed three additional models: sensitivity Model 1, excluding the component of a decline in eGFR of ≥40%, to evaluate outcomes less affected by the initial dip in eGFR due to the direct pharmacological action of dapagliflozin; sensitivity Model 2, excluding the component of a decline in eGFR of ≥40% and an eGFR slope to address additional concerns about the clinical relevance of the eGFR slope; and sensitivity Model 3, excluding all-cause mortality, which is more specific to kidney disease.

Statistical analyses

To evaluate the effect of dapagliflozin across the range of LVEF, data were analyzed for the pooled dataset of DAPA-HF and DELIVER, and for each trial dataset separately.

Baseline characteristics were summarized according to the randomized group as the mean with s.d., or the median with the interquartile range for continuous variables and count with percentages for categorical variables. Continuous variables were compared using a t-test or Wilcoxon rank-sum test; categorical variables were compared using a chi-squared test. To determine the slope of change in eGFR for each individual patient over time according to the assigned treatment, two-slope, mixed-effect models accounting for the acute and chronic phases were applied using the eGFR data obtained at all scheduled visits30. The acute phase was defined as the period up to the first postrandomization visit (14 days in DAPA-HF and 1 month in DELIVER) when the acute treatment effect on the eGFR was considered fully present. These models were adjusted for baseline eGFR values, randomized treatment, visit time, diabetes status, spline variable corresponding to the days since the acute phase, the interaction of treatment and visit time, and the interaction of treatment and spline, without an intercept term. The distributions of the individual eGFR slopes were drawn using violin plots.

The unmatched win statistics method, in which every patients in the dapagliflozin group was paired and compared with every patient in the placebo group, was used27; pairs representing the product of the number of individuals in the dapagliflozin group and placebo group were created and compared. Comparisons were made in ascending order of event tier (from 1 to 6); once a tier was settled, the next tier was not assessed; if the last tier was not settled, the comparison pair was considered a tie (Extended Data Fig. 7). In tiers 1–5, the time to first event was compared during a fixed follow-up period; censoring earlier than the defined fixed follow-up period was considered censoring at the fixed follow-up period to address the effect of censoring distributions on win statistics results27,54,55,56,57,58. Fixed follow-up periods were defined as 720 days in DAPA-HF and 1,080 days in DELIVER, considering the scheduled visits and follow-up period. In tier 6, the participant-level eGFR slope, which was calculated using data within these fixed follow-up periods, was compared as a continuous variable in each pair (that is, the patient with a shallower eGFR slope is the winner); thus, in the model including the eGFR slope, tied pairs did not exist. The proportions of win pairs (PW), loss pairs (PL) and tied pairs (PT) for participants assigned to dapagliflozin were obtained; PW is the number of win pairs divided by the total number of pairs nD × nP where nD and nP are the sample sizes in the dapagliflozin and placebo group, similarly for PL and PT. The method outlined by Pocock et al.27 and the corresponding variances based on the U-statistic-based method by Dong et al.59 were used to compute the win ratio. Because of a shortcoming of the win ratio that ignores ties when comparing pairs to obtain the win ratio, we calculated the ‘win odds’ for sensitivity Model 2, which is a modification of the win ratio accounting for ties60,61. Net benefit was also reported, which is the difference between the proportion of win and loss pairs58. We calculated four win statistics (win ratio, net benefit, win odds and win probability) defined as: win ratio, PW/PL; net benefit, PW − PL; win odds, (PW + 0.5 PT)/(PL + 0.5 PT); and win probability, PW + 0.5 PT. Thus, in the main model, sensitivity Model 1 and sensitivity Model 3, where tied pairs do not exist, the win ratio is identical to the win odds. A win ratio represents the ratio of the proportion of win pairs to the proportion of loss pairs; a win rate greater than 1 with a lower 95% CI greater than 1 indicates that dapagliflozin is better than placebo. Because the win or loss proportion depends on the duration of follow-up and the censoring distribution, we plotted these trends over time every 10 days55,62. This plot was drawn only for tiers 1–5 because the eGFR slope was calculated based on data across the fixed follow-up period, meaning it was not possible to report an eGFR slope at a specific time point and illustrate the proportion of the wins or losses over time for this component of the composite outcome.

We also evaluated the component of the kidney hierarchical composite outcome up to the aforementioned fixed follow-up period using conventional statistical approaches to compare these results with the ones from the win statistic. Cox proportional hazards models were used to compute the HRs (to aid direct comparison, these are presented as 1/HR) and Kaplan–Meier curves were plotted.

Consistent with the prespecified stratification variables in each respective trial, win statistics and Cox proportional hazards models were stratified according to diabetes status and trial in the pooled dataset31,32.

The sample size requirements and statistical power of the hierarchical composite endpoint (main model) were compared using bootstrap resampling of the pooled dataset with the time-to-first composite endpoint (all-cause mortality, ESKD or eGFR <15 ml min−1 1.73 m2, or decline in eGFR of ≥40%) and eGFR slope to detect the observed treatment effect for each endpoint. The resampling procedure was performed with 1,000 iterations at each sample size (n = 200, 500 and increments of 500 until 3,000).

All analyses were conducted using STATA v.17.0 and R v.4.2.2.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.