Causal inference using observational intensive care unit data: a scoping review and recommendations for future practice

Smit, J. M.; Krijthe, J. H.; Kant, W. M. R.; Labrecque, J. A.; Komorowski, M.; Gommers, D. A. M. P. J.; van Bommel, J.; Reinders, M. J. T.; van Genderen, M. E.

doi:10.1038/s41746-023-00961-1

Download PDF

Review Article
Open access
Published: 27 November 2023

Causal inference using observational intensive care unit data: a scoping review and recommendations for future practice

npj Digital Medicine volume 6, Article number: 221 (2023) Cite this article

3055 Accesses
5 Altmetric
Metrics details

Subjects

Abstract

This scoping review focuses on the essential role of models for causal inference in shaping actionable artificial intelligence (AI) designed to aid clinicians in decision-making. The objective was to identify and evaluate the reporting quality of studies introducing models for causal inference in intensive care units (ICUs), and to provide recommendations to improve the future landscape of research practices in this domain. To achieve this, we searched various databases including Embase, MEDLINE ALL, Web of Science Core Collection, Google Scholar, medRxiv, bioRxiv, arXiv, and the ACM Digital Library. Studies involving models for causal inference addressing time-varying treatments in the adult ICU were reviewed. Data extraction encompassed the study settings and methodologies applied. Furthermore, we assessed reporting quality of target trial components (i.e., eligibility criteria, treatment strategies, follow-up period, outcome, and analysis plan) and main causal assumptions (i.e., conditional exchangeability, positivity, and consistency). Among the 2184 titles screened, 79 studies met the inclusion criteria. The methodologies used were G methods (61%) and reinforcement learning methods (39%). Studies considered both static (51%) and dynamic treatment regimes (49%). Only 30 (38%) of the studies reported all five target trial components, and only seven (9%) studies mentioned all three causal assumptions. To achieve actionable AI in the ICU, we advocate careful consideration of the causal question of interest, describing this research question as a target trial emulation, usage of appropriate causal inference methods, and acknowledgement (and examination of potential violations of) the causal assumptions.

Guidelines for clinical trial protocols for interventions involving artificial intelligence: the SPIRIT-AI extension

Article Open access 09 September 2020

Reporting guidelines for clinical trial reports for interventions involving artificial intelligence: the CONSORT-AI extension

Article Open access 09 September 2020

Why do probabilistic clinical models fail to transport between sites

Article Open access 01 March 2024

Introduction

Many treatment choices in the intensive care unit (ICU) are made quickly, based on patient characteristics that are changing and monitored in real-time. Given this dynamic and data-rich environment, the ICU is pre-eminently a place where artificial intelligence (AI) holds the promise to aid clinical decision making^1,2,3. So far, however, most AI models developed for the ICU remain within the prototyping phase^4,5. One explanation for this may be that most models in the ICU are built for the task of prediction, i.e., mapping input data to (future) patient outcomes⁶. However, even a very accurate prediction of, for instance, sepsis⁷, does not tell a physician what to do in order to treat or prevent it. In other words, predictive AI is seldom actionable. More meaningful decision support in the ICU could be provided by models that assists clinicians in what to do (i.e., ‘actionable AI’^8,9). To develop actionable AI, one needs to take into account cause and effect. Causal inference (CI) represents the task of estimating causal effects by comparing patient outcomes under multiple counterfactual treatments^6,10. The most widely used method for CI is a randomized controlled trial (RCT). Through randomization (coupled with full compliance), the difference in outcome between treatment groups can be interpreted as a causal treatment effect. Because carrying out RCTs may be infeasible due to cost, time, and ethical constraints, observational studies are sometimes the only alternative. CI using observational data can be seen as an attempt to emulate the RCT that would have answered the question of interest (i.e., the ‘target trial’)¹¹. With such an approach, however, treatment is not assigned randomly and extra adjustment for confounding is required. In the simple situation of a time-fixed (or ‘point’) treatment (Fig. 1, Box 1)^12,13, this can be achieved by ‘standard methods’ like regression or propensity-score (PS) analyses¹⁴. However, ICU physicians are typically confronted with treatment decisions which occur at multiple time-points, i.e., time-varying treatments (Fig. 1, Box 1)^12,13. Estimating the effect of time-varying treatments using observational data is often complicated by treatment-confounder feedback¹⁵, leading to ‘treatment-affected time-varying confounding’ (TTC, Box 1)^13,16,17. Usage of standard methods in the presence of TTC leads to bias^18,19. Inverse-probability-of-treatment weighting (IPTW), the parametric G formula and G estimation (collectively known as ‘G methods’, Box 1) were introduced by Robins²⁰ to estimate causal effects in the presence of TTC, making them well-suited for CI in the ICU. Time-fixed treatments can be either unconditioned (e.g., ‘treat all patients at admission’) or conditioned on one or more covariates to make these more individualized (Fig. 1). Likewise, time-varying treatments can be either unconditioned, or tailored to more specific patient groups based on one or more (time-varying) covariates. We refer to these as static treatment regimes (STRs) and dynamic treatment regimes (DTRs), respectively (Fig. 1, Box 1)¹². The latter type is most common in the ICU, as treatment choices are typically dynamically re-evaluated based on the evolving patient state. For example, rather than deciding upon admission to administer vasopressors daily, an ICU physician reconsiders giving this treatment throughout the ICU stay based on the patient’s response. Hence, the practical question of interest often requires a comparison of DTRs. Reinforcement learning (RL)²¹ is another class of methods which, like G methods, can be used to estimate optimal DTRs and have been increasingly applied to ICU data²². Partly due to the different language used to describe similar concepts (see Supplementary Table 1), studies applying G methods and RL may appear as completely separate disciplines. However, they show great similarities and can be used to build actionable AI models. The aim of this scoping review is to (1) outline how CI research is conducted concerning time-varying treatments in the ICU, (2) discuss quality of reporting, and (3) give recommendations to improve future research practice.

**Fig. 1: Taxonomy of treatment types.**

Box 1 Causal inference glossary

Time-fixed treatment: a treatment that only occurs at the start of follow-up, or does not change over time.
Time-varying treatment: any treatment that is not time-fixed. Time-varying treatments can be sub-divided in static and dynamic treatment regimes.
Static treatment regime (STR): a treatment regime that is not tailored to evolving patient characteristics.
Dynamic treatment regime (DTR): a treatment regime where the treatment decisions depend on changing patient characteristics and/or treatment history.
Treatment-affected time-varying confounding (TTC): time-varying confounding in which one or more time-varying confounders are affected by previous treatment.
G methods: a class of methods proposed to adjust for TTC in the estimation of time-varying-treatment effects, including inverse-probability-of-treatment weighting, the parametric G formula, and G estimation.
Reinforcement learning (RL): a class of methods that deals with the problem of sequential decision making which returns an optimal treatment regime (or ‘policy’).
Off-policy evaluation (OPE): the task of estimating the value of a certain policy, using data from settings where patients were treated according to a different policy (i.e., observational data).
Causal assumptions:
- • Conditional exchangeability: Exchangeability means equal risks in treated and untreated groups if patients in the untreated group were treated, and vice versa. Observational data often violates exchangeability due to confounding and/or selection bias, and, therefore, causal inference requires the assumption that all confounders are measured and adjusted for, to achieve exchangeability conditional on these confounders.
- • Positivity: To estimate a treatment’s causal effect, one must compare treated and untreated patient data. This requires having both treated and untreated patients in all subgroups (or ‘strata’) defined by different confounder values. In other words, treatment and non-treatment should occur in all confounder strata with some positive probability.
- • Consistency: Consistency assumes that the outcome for a given treatment will be the same, irrespective of the way treatments are ‘assigned’.

Results

Through our search, we discovered 2184 articles that were unique. During the independent screening of titles and abstracts, we achieved a high agreement rate of 93%. Following the resolution of any disagreements, we excluded 1981 studies. Moving forward, we conducted independent full-text screening for the remaining 203 articles, resulting in an agreement rate of 86%. Once a consensus on eligibility was reached, we included 79 studies in the review (Fig. 2). The articles were published between 2005 and 2023, with a steadily growing number of articles per year starting around 2010 (Supplementary Fig. 1a). A reference list of all included studies and the list with collected items per study can be found in the Supplementary References and supplementary Table 2, respectively. Most studies applied G methods (n = 48, 61%), of which 42 (53%) used IPTW, five (6%) used the parametric G formula. One (1%) study²³ used targeted minimum loss-based estimation (TMLE), a method which combines elements of IPTW and the parametric G formula²⁴. 31 (39%) studies used RL methods (Table 1). The three most frequently studied treatment categories were anti-inflammatory drugs (n = 9, 11%), hospital-acquired complications (n = 8, 10%), and sedatives and analgesics (n = 8, 10%). Most studies (n = 46, 58%) considered mortality (at varying follow-up times) as the primary outcome. Thirty-six studies (46%) included data from at least two different ICUs. Studies that used RL generally included more patients than studies that used G methods, with a median of 9782 patients (IQR 5022–17,898) versus 1498 (IQR 606–3407) and relied more often on open-source ICU databases (74% vs 23%). In total, 34 (43%) of the studies used one or more open-source ICU database, among which the Medical Information Mart for Intensive Care (MIMIC)-III database²⁵ was the most frequently used (n = 24, 30%). In contrast to RL studies (which inherently deal with DTRs), only eight^{26,27,28,29,30,31,32,33} of the 48 studies (17%) that used G methods considered DTRs and seven of these were published in or after 2020 (Supplementary Fig. 1b).

**Fig. 2: Flowchart of study selection.**

Table 1 Characteristics of the included studies grouped by used causal inference method.

Full size table

Method-specific items

Among the studies that used IPTW (n = 43), eighteen applied stabilized weights, one applied weight truncation, and ten studies applied both weight stabilization and truncation. The remaining studies did not apply weight stabilization or truncation. Among studies that applied RL on real (i.e., not simulated) patient data (n = 26), fourteen studies used either an importance-sampling based³⁴, model-based^35,36, a doubly robust off-policy evaluation (OPE, Box 1) method³⁷, or a combination of these. Thirteen studies used the so-called ‘U-curve method’³⁸ (Box 1) and for nine of these, this was the only reported OPE method. Two studies did not report any evaluation of the optimized DTR (Supplementary Fig. 2).

Quality of reporting

A total of 1183 unique quality of reporting (QOR) items were assessed independently, encompassing the eight, eleven, and ten target trial subcomponents for the IPTW/TMLE, parametric G formula, and RL papers, respectively, along with six distinct items for the causal assumptions in all included studies. Initially, the reviewers agreed on 1012 items (86%), and any remaining disagreements were resolved through discussion until consensus was reached. As each subcomponent of the analysis plan component required for IPTW studies also applies to TMLE, we used the same subcomponents to judge the analysis plan component of the study that used TMLE²³.

Target trial components

Both the ‘eligibility criteria’ and ‘outcome’ components were reported in 78 (99%) of the studies (Fig. 3a). We scored the ’treatment strategies’, ‘follow-up period’ and ‘analysis plan’ components as partially or not reported in respectively 14 (18%), 16 (20%) and 33 (42%) of the studies. All five target trial components were fully reported in only 30 (38%) studies. The reporting of the target trial components grouped by used CI method are summarized in Supplementary Figs. 3–5 and tabulated for each individual study in Supplementary Tables 3–5.

**Fig. 3: Quality of Reporting summary plots.**

Causal assumptions

The conditional exchangeability assumption remained unmentioned in 25 (32%), was mentioned in 34 (43%), and an attempt to check for potential violations was reported in 20 studies (25%, Fig. 3b). Among the studies that reported a check for potential violations, eight studies^{31,33,39,40,41,42,43,44} performed a bias analysis. The positivity assumption remained unmentioned in 54 (68%), was mentioned in five (6%), and a check for potential violations was reported in 20 (25%) of the studies. The consistency assumption was mentioned in nine (11%) of the studies. All three assumptions were mentioned (or a check for potential violations was reported) in only seven (9%) studies (Supplementary Table 6). The reporting of assumptions grouped by CI method used are summarized in Supplementary Figs. 6–8 and individual results for all studies are tabulated in Supplementary Table 6. In general, the causal assumptions remained unmentioned more often in studies that applied RL, compared to those which applied G methods (Supplementary Figs. 6–8). All studies that reported a check for potential violations of the conditional exchangeability assumption also mentioned this assumption, whereas for the positivity assumption, eleven out of 20 studies that reported a check for potential violations did not explicitly mention positivity (Supplementary Table 6).

Adjusting for treatment-affected time-varying confounding

Seventeen studies (22%) estimated the treatment effect by adjusting for baseline confounding and by adjusting for baseline confounding and TTC. For most of these studies, the point estimates of the treatment effects varied substantially after adjusting for both baseline and TTC, moving the effect estimate towards or away from the null hypothesis, or even leading to opposite effect estimates (Fig. 4).

Discussion

Our scoping review of 79 published studies revealed a diverse range of treatments being investigated. Despite the fact that most treatments of interest in the ICU setting are DTRs, we observed a dominant emphasis on STRs among studies that used G methods. Many studies had inadequate reporting of the components of target trials, with the ‘treatment strategies’, ‘follow-up’ and ‘analysis plan’ components being incompletely reported most frequently. The causal assumptions were frequently not specified, especially in studies utilizing RL methods, indicating a potential lack of awareness in this research field of the importance of these assumptions. The upcoming ‘Prediction of Counterfactuals Guideline’ (PRECOG)⁴⁵ aims to provide guidance on the reporting of causal assumptions and model evaluation in CI studies using observational data, and we anticipate that its adoption may yield substantive enhancements in the quality of reporting in forthcoming CI studies utilizing observational ICU data. We decided not to employ the ROBINS-I tool⁴⁶ for assessing bias in observational studies in our evaluation. We made this choice for two main reasons. Firstly, utilizing ROBINS-I would require verifying specific causal assumptions, which, in turn, would demand expert knowledge of the treatment-outcome relationships studied in each of the included articles. Given the extensive scope of our review and the fact that many studies did not explicitly address these causal assumptions, we found this task to be beyond the intended scope of our review. Consequently, we opted to evaluate studies based on their acknowledgment of these assumptions and their reported attempts to validate them. Secondly, when conducting causal inference with observational data, the potential for bias arises not only from the absence of randomization but also from potential biases stemming from incorrect study design choices⁴⁷. Rather than assessing these design choices directly, we chose to evaluate studies based on the clarity and completeness of their reporting regarding these choices. To structure this evaluation, we used the target trial framework¹¹, which is widely recognized and used as a conceptual benchmark to describe study design choices in causal inference studies using observational data across various medical domains^{48,49,50,51,52}. G methods and RL methods are often perceived as separate disciplines, but show great similarities⁵³. For example, Q-learning⁵⁴ (an RL method, used by many of the included studies^{55,56,57,58,59}) is very similar -and under certain conditions even algebraically equivalent- to G estimation (a G method)⁶⁰. An important difference is that G methods are used for modeling both STRs and DTRs, while RL methods typically deal with DTRs. As both G methods and RL methods perform the same CI task (i.e., finding optimal treatment regimes), both rely on the same, strong causal assumptions, which should be acknowledged. While the consistency assumption is often plausible for treatments in the ICU, violations of the conditional exchangeability and positivity assumption are more likely and should be examined. Prior to examining violations of the causal assumptions, one needs a research question that is truly of interest in the ICU, a clear description of study design choices (e.g., using the target trial framework¹¹), and usage of a CI method that is appropriate for the type of studied treatment. The results of our scoping review have led to five recommendations that build on widely accepted views and concepts in the field of CI to improve future research and move towards actionable AI in the ICU (Box 2).

Box 2 Summary of recommendations for future research

Ask the right research question

When developing a model for causal inference, consider clinically relevant treatments. In the ICU, treatment decisions typically occur at multiple time point during admission (i.e., time-varying treatments) and often depend on the patient’s response to treatment (i.e., dynamic treatment regimes).
Describe the question as a target trial emulation

It is useful to imagine a randomized trial that would have answered the research question (i.e., the target trial). Even if performing the target trial is not feasible, describing its components using the target trial framework helps to identify flaws in the relevance of a research question and correctness of the analysis.
Use methods that suit the research question

Standard methods, like propensity-score analyses, are easy to implement and suffice for time-fixed treatments, but lead to biased estimates when used to adjust for treatment-affected time-varying confounding (TTC). Research questions concerning time-varying treatments are often complicated by TTC, and, therefore, require methods that can appropriately adjust for this.
Mind the conditional exchangeability assumption

Causal inference is not possible based on data only, and incorporation of expert knowledge is key to think about the causal structure between the treatment and outcome of interest. Representing this expert knowledge in causal diagrams is useful to visualize potential sources of bias. A bias analysis can be helpful to quantify how much bias unmeasured confounders could produce.
Mind the positivity assumption

This assumption is verifiable, but this is rather complex for time-varying treatments and violations are expected given the dynamic nature of the ICU. Violations could be minimized by increasing the sample size (e.g., by more usage of open-source ICU databases) and by studying treatment regimes that are (more) similar to those observed in the data. Examination of estimated inverse-probability-of-treatment weights is useful to detect (but not rule out) positivity violations.

Ask the right research question

Treatments of interest in the ICU are typically DTRs and, therefore, this type of treatments is expected to be the focus of CI research in the ICU. However, despite an upward trend in recent years (Supplementary Fig. 1b), still 83% of the studies that used G methods studied STRs. To illustrate that many of these studies are considering research questions that are not truly of interest in the ICU, we will explore some examples. Zhang and colleagues divided patients into two groups according to whether they received diuretics within the first two days of ICU admission or not⁶¹. Thus, the emulated target trial answers the question whether or not to administer diuretics at the start of ICU admission. However, we argue that the question an ICU physician is really interested in is when to administer diuretics throughout the whole ICU stay, taking into account changing patient characteristics such as fluid balance (especially at later ICU stages). In addition, many of the included studies emulated target trials comparing ‘giving treatment sometime during follow-up’ versus ‘never giving treatment’. For example, Bailly and colleagues studied the effect of systemic antifungal therapy, comparing a treated group (those who received antifungals during their ICU stay) with an untreated group (those who never received antifungals)⁶². As giving treatment ‘sometime during follow-up’ can be done in many ways, the estimated treatment effect is ill-defined and typically not truly of interest. In other words, both studies by Zhang⁶¹ and Bailly⁶² serve as examples of emulated RCTs that would never be conducted in the ICU. Conversely, Morzywołek and colleagues²⁷ emulated a target trial that reflects a very relevant research question, i.e., what is the optimal moment to start renal replacement therapy in acute kidney injury? They identified an optimal DTR based on a combination of biomarkers lowered mortality without increasing the number of RRTs, offering valuable insights for planning a future confirmatory RCT. Wang and colleagues investigated the effectiveness of low tidal volume ventilation³², a DTR investigated in one of the most influential RCTs in the field of intensive care medicine⁶³. This trial faced criticism (among other concerns) due to high non-adherence rates. Consequently, the target trial these authors emulated reflected the highly relevant question: what would the treatment effect have been under full compliance?

Describe the question as a target trial emulation

It can be beneficial to conceptualize a randomized trial that could have addressed the research query (i.e., the target trial). Even in cases where executing the exact target trial is unfeasible, outlining its constituent elements within the target trial framework¹¹ aids in recognizing shortcomings in the appropriateness of the research question and accuracy of the analysis. Almost one fifth of the included studies lacked a clear description of the ‘treatment strategies’ component of the target trial, that is, which treatment regimes are compared in the target trial. For example, Arabi and colleagues⁶⁴ used IPTW to study the effect of corticosteroid therapy for ICU patients with Middle East Respiratory Syndrome. However, it remains unclear which treatment regimes (e.g., ‘treat daily with corticosteroids’) are being compared. Moreover, more than one third of the included studies lacked a complete description of the ‘analysis plan’ component and therefore, are not reproducible. We advocate detailed description which allows reproduction of the used methodology, ideally accompanied with code and (example) data.

Use methods that suit the research question

Many studies were not included in this review because they modeled time-fixed treatments. As time-fixed treatments in the ICU are rare, we hypothesize that in many of these studies, the implicit treatment of interest is time-varying. Research questions concerning time-varying treatments may be reformulated into simplified, time-fixed versions, because standard methods are easier to implement or high-quality, longitudinal data is unavailable. One may argue that, if the bias introduced by TTC^18,19 is negligible, standard methods suffice for CI in time-varying treatments as well. However, empirical results from studies included in this review suggest that adjusting for TTC can lead to substantial differences in effect estimates and sometimes even to opposite conclusions (Fig. 4). Hence, it is possible that many of the excluded studies that implicitly studied time-varying treatments but modeled these as if they are time-fixed, published biased effect estimates. We advocate adjustment for TTC in any CI study where the treatment of interest is time-varying. Modeling DTRs is slightly more complex than STRs (which may be a reason for the focus on STRs among the included studies) and therefore requires different approaches. Various methods exist to find optimal DTRs, either from a set of pre-specified regimes or directly from data (for an overview, we refer to the book by Chakraborty and Moodie⁶⁵). Among the included studies in this review, for example, Shahn and colleagues²⁸ used ‘artificial censoring/IPW’^65,66,67 to estimate the optimal fluid-limiting treatment regime for sepsis patients among a pre-specified set of DTRs (i.e., ‘fluid caps’). RL methods and G estimation can be used to approximate optimal DTRs without a pre-specified set of regimes. In RL studies, finding the optimal treatment regime (typically referred to as the optimal ‘policy’, see Supplementary Table 1) is typically followed by a validation step to quantify the value of the optimized regime (i.e., OPE, Box 1). The ‘U-curve method’³⁸, a frequently used OPE method among the included RL studies in this review (Supplementary Fig. 2), is based on associating the difference between the (observed) clinician’s treatment regime and the (estimated) optimal treatment regime with patient outcome. As it completely ignores the potential effect of confounders, we recommend avoiding this method.

Mind the conditional exchangeability assumption

Conditional exchangeability is never guaranteed using observational data as the absence of unmeasured confounders is not verifiable in the data. To think about residual confounding or selection bias, incorporation of subject-matter expertise is key. Directed acyclic graphs (DAGs)⁶⁸ provide a simple and lucid approach for researchers dealing with observational data to showcase this expert knowledge, theories, and suppositions regarding the causal relationships among variables. For practical guidance on effective utilization of DAGs we refer to the work of Tennant and colleagues⁶⁹. There are different approaches to quantify how potential violations of the conditional exchangeability would affect the study results⁷⁰. Indirect approaches consider, for instance, the effect of adding additional confounders¹¹. A ‘bias analysis’ (or sensitivity analysis)⁷¹ examines the characteristics of potential unmeasured confounders and can be useful to quantify how much bias it would produce as a function of those characteristics.

Mind the positivity assumption

The positivity assumption -on the contrary- is verifiable, although this is rather complex for time-varying treatments⁷² and, given its dynamic nature, violations are expected in the ICU setting. The intuition for this assumption is that one can only study a treatment regime using data of patients who have received treatment that conform to this regime. The number of patient treatment histories that match the treatment regime of interest (i.e., the ‘effective sample size’⁷³) shrinks with the number of treatment decisions in the patient’s history (which tends to be high in the ICU). For example, Gottesman and colleagues³⁸ applied RL to a dataset of 3855 patients to find an optimal treatment regime for sepsis, but the effective sample size for this regime was only a few dozen. A small effective sample size makes positivity222222 violations likely and leads to high uncertainties in estimated treatment effects. A straight-forward opportunity to tackle this challenge is increasing the sample size. Therefore, we advocate more usage (if appropriate) of the four currently available open-source ICU databases⁷⁴. However, increasing the sample size does not guarantee increasing the effective sample size, as the patients in the extra dataset may not be treated according to the regime of interest. Hence, another opportunity to increase the effective sample size is to minimize the mismatches between the treatment regime(s) of interest and those observed in the data. Studies employing G methods may simply accomplish this by avoiding the modeling of treatment regimes which differ greatly from the treatment protocol in place. In contrast, studies utilizing RL typically do not pre-specify treatment regimes (as they optimize them through learning agents) and consequently, avoiding specific treatment regimes (or policies) is more challenging. In the RL literature, various approaches have emerged to address this issue, resulting in policies more closely aligned with physician practices⁷⁵. These approaches can be categorized into ‘policy constraint’ and ‘uncertainty based’ methods, with the latter exemplified by the application of Conservative-Q Learning in two of the included RL studies^76,77,78. Detection (but not ruling out) of violations of the positivity assumption can be facilitated through examination of the distribution of estimated (stabilized) inverse-probability-of-treatment (IPT) weights⁷², which was done in 20 of the 42 studies that utilized IPTW (Supplementary Table 6). This examination is recommended not only for IPTW but also for studies utilizing other CI methods. In studies using RL methods that employ importance-sampling³⁴ (which is closely related to IPTW⁵³) for OPE, analogous to the examination of IPT weights, it is recommended to examine the distribution of the importance weights³⁸. In studies using IPTW, weight stabilization and truncation can be used to limit high uncertainties in the effect estimates. Weight stabilization can improve the precision of effect estimates without the introduction of bias. However, a model based on stabilized weights results in a slightly different effect estimate compared to non-stabilized weights⁷⁹ and, therefore, should be interpreted carefully. Weight truncation also improves precision, but at the expense of bias. Examination of the influence of the introduced bias by checking the change of the effect estimates under progressive truncation of IPT weights is recommended⁸⁰.

This review stresses the importance of causality for actionable AI and provides a contemporary overview of CI research in the ICU literature. We describe shortcomings of the identified studies in terms of reporting and, furthermore, provide handles for improving future CI research. These recommendations are not limited to the ICU but apply to medical CI research as a whole. Unlike other reviews on time-varying medical treatments^22,81,82, we did not limit our focus to either G methods or RL, but rather acknowledge that both these method classes can be used to perform CI tasks and therefore, hold the promise to bring actionable AI to the bedside.

Our review has limitations. First, whereas efforts were made to ensure that the literature search was comprehensive, we could have missed studies for different reasons. Some research might have employed unconventional terminology to delineate their chosen CI method or utilized a CI approach that fell outside the scope of our search strategy. For example, dynamic weighted ordinary least squares (dWOLS)^83,84 is a relatively new method which has been used to model DTRs in the ICU setting in several studies^85,86. This method benefits from properties of both Q-learning (an RL method) and G-estimation (a G method) and may be an interesting direction for future research. Also, digital twin technology builds on causal inference and has been applied in the ICU setting⁸⁷. Second, items that were not collected could be of interest for future investigation. For example, we did not differentiate RL further into specific RL methods.

Towards actionable AI in the ICU, we concur with the guidance of editors of critical care journals^88,89 to change the focus of observational research in the ICU from prediction to causal inference. To unlock this potential in a trustworthy and responsible manner, we advocate development of models for CI focusing on clinically relevant treatments, using a description of the research question as a target trial emulation, choosing appropriate CI methods given the treatment of interest and acknowledging (and ideally examining potential violation of) the causal assumptions.

Methods

The study protocol was registered in the online PROSPERO database (CRD42022324014)⁹⁰. The filled-in PRISMA Extension for Scoping Reviews (PRISMAScR) checklist⁹¹ can be found in Supplementary Table 7.

Search strategy

Candidate articles were identified through a comprehensive search in Embase, MEDLINE ALL, Web of Science Core Collection, Google Scholar, medRxiv, bioRxiv, arXiv and ACM Digital Library up to March 2023, with no start date. We developed a search strategy that could be modified for each database (Supplementary Table 8). Search terms included relevant controlled vocabulary terms and free text variations for CI, G methods, or common RL methods, combined with ICU related terms.

Eligibility criteria

We included any primary research article, conference proceedings or pre-print papers that present models for the task of CI concerning time-varying treatments in patients admitted to the adult ICU. Articles were not eligible if it modeled a time-fixed treatment, utilized data from an RCT (unless the treatment of interest was not the randomized treatment), it focused on the introduction of new methodology, or was an abstract-only, review, opinion, or survey. Duplicates and articles not written in English were also excluded.

Study selection

Duplicate removal and eligibility screening were performed using EndNote 20 (Clarivate Analytics, Philadelphia, PA, USA). We used a two-stage approach for screening: first by title and abstract and then by full article text. Two reviewers (JS and WK) independently screened the titles and abstracts. The two reviewers then independently performed full-text screening on all articles. Conflicts regarding the eligibility of studies during the screening process were resolved by consensus in regular sessions between the two reviewers.

Data extraction

Data were extracted by JS and confirmed by WK. The items for extraction were based on the STROBE (STrengthening the Reporting of OBservational studies in Epidemiology) checklist⁹², supplemented by method-specific items. We extracted the following items from all included studies: details of study variables (i.e., studied treatment and primary outcome), the number of included ICUs, usage of open-source database(s), number of participants included, studied treatment type (Fig. 1), and used CI method. In addition, we extracted the following method-specific items: the usage of methods to reduce extreme weights (i.e., weight stabilization⁹³ and truncation⁸⁰) for studies using IPTW and the off-policy evaluation⁹⁴ (OPE, Box 1) method used for studies using RL. Finally, if a study estimated the treatment effect both by adjusting for baseline confounding and by adjusting for baseline confounding and TTC, we also collected these different estimates.