Sepsis continues to cause significant morbidity and mortality among preterm very low birthweight (VLBW) infants in the neonatal intensive care unit (NICU), and earlier detection and treatment can reduce mortality and improve outcomes for survivors. In this narrative review, we address a number of questions related to artificial intelligence (AI) for sepsis prediction and detection in NICU patients. First, we discuss aspects of neonatal sepsis that make it a tractable problem for machine learning (ML) predictive models. Next, we cover technical aspects of ML model development and validation, including variable selection using both static and dynamic data. We then review some existing early warning and ML systems. Finally, we discuss the benefits of and barriers to implementing sepsis prediction systems in the NICU, with the goal of “right timing” antibiotics to improve patient outcomes.

Q1: How suitable is the problem of neonatal sepsis for AI solutions?

Premature infants in the NICU are, in a number of ways, an ideal population for AI-based sepsis monitoring. They are immune-compromised and require invasive devices that create a high risk for sepsis, yet they may have a period of relative stability before developing sepsis. Late-onset sepsis (LOS) does not present when the pathogen invades the blood stream, but instead as a sub-acute physiologic response with inflammation and organ dysfunction. Therefore, when advanced analytics of patient-generated data can detect the transition from “well” to “ill”, predictive models can translate this information to the clinical team to provide earlier warning of a sub-acute, potentially catastrophic deterioration. The advantage of early warning and treatment of sepsis must be considered in balance with the potential disadvantage of increasing antibiotic exposure, which has negative consequences.1,2,3,4,5 Non-specific signs of sepsis such as apnea and respiratory distress and the risk of rapid deterioration with delayed treatment make this balance challenging for clinicians. Thus, it is imperative to not only develop sepsis warning systems with limited false alarms, but also to teach clinicians to use AI model output in the context of all available clinical data in making decisions about starting and stopping antibiotics. A final consideration is the distinction between early- and late-onset sepsis (EOS within 3 days from birth, LOS after 3 days). For EOS, a simple static prediction model (the EOS calculator) has been developed and its broad implementation has reduced antibiotic use.6 In this review we focus primarily on the prediction of LOS incorporating both static and dynamic data, including continuously streaming vital sign data from NICU bedside monitors.7

Prediction models will perform best if the targeted outcome is well-defined and validated. For sepsis, this requires careful medical record review rather than simply relying on ICD codes since many studies have shown that diagnostic codes for sepsis are inaccurate.8,9 A challenge with regard to neonatal sepsis is the lack of a consensus definition,10,11 making it difficult to compare and interpret results across studies.12,13 Some prediction models train only on culture-positive sepsis, while many include cases of “clinical sepsis” in which an infant has significant signs of illness and clinicians opt to prolong antibiotic treatment despite negative cultures. Experts argue that in the setting of modern laboratory equipment and sufficient inoculation volume, the likelihood of a false negative blood culture is extremely low.14,15 Nonetheless, as discussed later in this review, many prediction models have been developed using both clinical and culture-positive sepsis cases, and it is therefore important for clinicians to use judgment to decide on the duration of therapy in the face of a high or rising risk score, since misuse of antibiotics can lead to adverse outcomes.16

Finally, for AI models to be widely useful for NICU patients they must be generalizable and reproducible. FAIR data principles were proposed as a way for AI research and development to achieve this goal—data should be findable, accessible, interoperable, and reusable.17,18 Large data sets and external validation are likely to improve generalizability and translation to clinical care. However, generating data that are FAIR and models that are externally validated in large cohorts is no small task; it typically requires long-standing multicenter, multi-specialty research collaborations.

Q2: What are the important AI and machine learning model development concepts?

Figures 1 and 2 provide a conceptual overview of aspects of AI relevant to healthcare applications, from algorithm development through clinical implementation and integration. ML is a type of AI that includes supervised methods such as classification and regression, using algorithms to find structure in labeled data, and unsupervised methods involving clustering and dimension reduction of unlabeled data. Generally, sepsis prediction models use supervised ML with various modeling methods, including regression, tree-based methods, neural networks, and others. In some studies, a variety of modeling methods were shown to have similar predictive performance,19,20 while in other studies, a specific method is found to have better performance.21

Fig. 1: Concepts overview.
figure 1

Schema illustrating the overlapping concepts of artificial intelligence, machine learning, and prediction models with brief descriptions.

Fig. 2: Artificial intelligence (AI) process diagram.
figure 2

A conceptual diagram illustrating the key steps of developing sepsis AI technology, from idea to model development, testing, and translation to clinical implementation.

A common way to assess ML model performance is using the area under the receiver operating characteristics curve (AUC) to summarize the model’s ability to discriminate cases from controls over all possible thresholds.22 The AUC value alone is insufficient to evaluate model performance since it does not consider prior probability, does not provide information about the distribution of errors, and weights omission and commission errors equally.23,24 Moreover, even a model with good discrimination may provide risk estimates that are unreliable.25

Another way to evaluate model performance is by calculating sensitivity, specificity, and negative and positive predictive values (NPV and PPV).26 Although LOS occurs in approximately 15% of very preterm infants, the chance of an infant developing sepsis on any particular day is quite low. Thus, the PPV of models developed to continuously evaluate the risk of imminent sepsis will be low in order to have acceptably high sensitivity. ML model performance should also be evaluated using qualitative methods, such as calibration plots and time-to-event plots (Fig. 3). The calibration of a model’s risk predictions can be visualized by plotting the observed risk as a function of the predicted risk.27 Time-to-event plots show the average model output in a cohort relative to the time of the event and illustrate the horizon or lead time for sepsis prediction. This qualitative model assessment provides valuable information about its clinical utility since a score without a rise before clinicians recognize illness is not likely to benefit patients.

Fig. 3: Examples of model performance metrics.
figure 3

The left panel displays an area under the receiver operating characteristic curve for a population of patients. The AUC is created by plotting the sensitivity against the specificity across all thresholds of the model output. The middle panel is an example of three model calibration curves for a population of patients. Predicted risk relative to average is on the abscissa and observed risk relative to average is on the ordinate. Each point represents one decile of predicted risk. The line of identity is shown as a dashed line and represents a well-calibrated model. The other two lines represent models that either over- or under-predict risk. The right panel is an example of a “time-to-event” plot for an individual patient. In this example, the model output (risk of sepsis score) rises steeply 4 h before a patient had a blood culture sent that diagnosed sepsis at time zero. Theoretically, if clinicians could see and interpret the rising score, antibiotics might have been given 4 h earlier in this example.

Once a model is developed or trained, testing or validation is an essential next step. A validation data set can be internal (a subset of the original data set) or external, from a new cohort at a different center. Validation in data sets with similar patient characteristics and practice patterns compared to the training data provides evidence for reproducibility of model performance, while external validation using data from cohorts with different characteristics (for example, different centers, patient demographics, level of illness, or clinical practices), provides evidence of model transportability.28 In an example from our prior work, we showed that differences in invasive versus non-invasive respiratory support across NICUs impacted the performance of a sepsis prediction model that incorporated features to detect apnea.29 In addition to external validation, ongoing evaluation of ML models ensures adequate performance after implementation. Data shift or drift may occur over time as practices, hospital systems, and patient populations change.30 Examples that could impact NICU sepsis model performance include a change in bedside monitors with differences in HR or SpO2 averaging times, change in practices for obtaining specific laboratory tests that serve as model inputs, or changes in the use of medications or respiratory support that may impact vital sign patterns.

The ultimate step in model evaluation is conducting randomized clinical trials to determine whether displaying the output of an AI algorithm leads to meaningfully improved outcomes. Only through well-designed, large clinical trials will sepsis AI systems be trusted, implemented, and routinely used for patient care. Finally, since many research groups are developing sepsis AI, it is important that results and algorithms be shared among researchers. In order to better interpret results across studies, models should be reported using a standardized format such as TRIPOD (Transparent Reporting of a multivariable prediction model for Individual Prognosis or Diagnosis)31 or a subsequent format specific for AI (TRIPOD-AI).32

Q3: Which data are useful in neonatal sepsis prediction models?

When predicting imminent LOS, AI models may use physiologic data derived from high-resolution data (e.g., the electrocardiogram waveform signal sampled at 250 Hz), low-resolution data (e.g., demographics, clinical risks or signs, intermittently sampled vital signs, or laboratory tests), or a combination of both. The inflammatory response to sepsis manifests as changes in multiple physiologic processes that we measure as vital signs, making these data particularly useful for detecting and predicting LOS.33,34 Patterns in continuous cardiorespiratory data have been identified as signatures of illness due to sepsis. Predictive modeling translates these physiomarkers of sepsis into a predicted risk of imminent deterioration, which has potential clinical utility as an early warning system. For example, low variability of HR accompanied by HR decelerations was recognized as a signature of illness in neonates.35 The mechanism of abnormal heart rate characteristics (HRC) during sepsis involves cytokine signaling and autonomic nervous system activation with increased vagus nerve firing.36,37,38,39 The HRC index, a continuous sepsis prediction model, was developed to capture these abnormal patterns and is discussed later in this review.7

In searching beyond HR patterns for physiomarkers of sepsis, a logical place to look is in the respiratory data. An increase in central apnea is one of the major signs of sepsis in preterm NICU patients,38,39,40,41,42 due in part to the cytokine-triggered release of endogenous prostaglandins.38,39,40 Apnea detection through chest impedance waveform analysis is complicated, while detection of a decline in HR and SpO2 that often accompany apnea is simpler. One analytic that serves this purpose is the cross-correlation of HR and SpO2, which measures the degree to which the two signals co-trend within a set lag time. An increase in this metric captures deceleration-desaturation events, which correlate with increased central apnea or exaggerated pathologic periodic breathing in preterm infants.29,41

Changes in physiologic data can be non-specific while still useful for sepsis detection and prediction. Interpreting a rising sepsis risk score may require consideration of the clinical context and the patient’s baseline condition. Some preterm infants have chronically abnormal HR and SpO2 patterns reflecting pathologies unrelated to sepsis. One solution could be to incorporate a patient’s baseline into the calculated risk to account for inter- and intra- patient variability and allow for personalized AI predictions.

Demographic, laboratory, and clinical data for sepsis AI algorithms

The EHR contains many pertinent clinical data that add to sepsis risk prediction. Lower gestational age and birthweight correlate strongly with rising risk of LOS, and can stand alone to risk stratify premature infants at birth or add to models that use continuous vital sign data.7,43,44,45 Postnatal and postmenstrual age also add to risk prediction due to the peak in LOS incidence at 1–3 weeks of age.46,47 Additional demographic and perinatal variables may improve model performance, such as sex,48,49 race, ethnicity, or delivery mode.50,51 While including these variables in ML model is likely to improve the AUC, they provide only static information.

Laboratory tests that measure components of the host response to infection, such as immature neutrophils or serial C-reactive protein values are commonly used tests for sepsis screening and may serve as decision support for either starting or withholding antibiotics in conjunction with other clinical variables.52,53,54,55,56 However, such tests are typically ordered by the clinician with concern for sepsis and therefore the information they provide is likely to lag behind clinical suspicion rather than provide early warning.

Clinical risk factors may be incorporated into sepsis ML models including the presence of a central vascular catheter or mechanical ventilation and medications that increase sepsis risk such as postnatal steroids. Clinical signs of sepsis may also be incorporated into models, including increased apnea, respiratory distress, feeding intolerance, poor perfusion, temperature instability, hypotension, and lethargy.42,57,58 These signs may be captured from EHR documentation, but once an ICU clinician documents “lethargy” in the EHR they typically have already ordered the blood culture and antibiotics. Several published sepsis detection models instead use patient-generated data to detect clinical signs of sepsis, including using HR and SpO2 data to detect an increase in apnea,41 core to peripheral temperature differential to detect impaired thermoregulation,59 and cardiorespiratory waveform data to detect decreased infant motion or lethargy.60 AI models for LOS in the NICU that include continuous physiologic data are likely to be more clinically useful than those that use only EHR data.

Q4: What advanced warning systems for sepsis exist, and what lies in the future?

Before discussing AI systems for sepsis, consideration should be given to “Early Warning Scores” (EWS). EWS and ML models are both designed to alert the medical team to concerning clinical changes that might otherwise go unnoticed. Both can be integrated into the EHR or displayed at the bedside, and both can incorporate information from a mix of static and dynamic clinical variables. EWS employ a “track and trigger” approach, whereas AI models use math and the data to learn temporal trends and correlation among parameters.61 For example, the Pediatric Early Warning Score is calculated based on periodic observations of multiple physiological parameters and designed to predict clinical deterioration (including but not limited to sepsis) in hospitalized children.62,63 AI models would be expected to perform better than EWS because they can use the data as continuous rather than categorical values determined by thresholds. Additionally, modeling rather than empirically derived cutoffs can detect more subtle and complex patterns in the data associated with the target outcome.

Though this review is focused on neonatal LOS prediction, the EOS calculator deserves mention since it is widely used and exemplifies some important aspects of sepsis AI.6,64,65 The model uses perinatal risk factors known at the time of birth in a logistic regression model to derive prior probability and then incorporates the clinicians’ assessment (asymptomatic, equivocal, or clinically ill) using Bayes’ theorem. The risk per 1000 live births is displayed for each of the three categories of illness.6,64 Decision support is provided, allowing for clinical judgment to guide the application of the AI technology, which is likely a factor in the widespread adoption of this model. Studies of the impact of the EOS calculator have shown it reduces the number of asymptomatic or equivocal infants with sepsis risk factors undergoing laboratory evaluations and exposure to antibiotics.64

Developing a calculator for LOS would be substantially more complicated, since it is more common than EOS, occurs over a wide time range, and has non-specific clinical signs that are common in preterm infants with non-infectious conditions. Nonetheless, a number of tools for predicting LOS in NICU patients before they are obviously sick have been published.66 Table 1 summarizes models using continuous or intermittently sampled data to predict LOS before clinical deterioration prompting a blood culture and antibiotics.7,60,67,68,69 Several other studies have used data at the time of blood culture to predict whether sepsis will be ruled in or out (positive versus negative culture) which may be useful for determining when to start and stop antibiotics.70,71 And finally, several studies have used vital sign data shortly after birth to predict the risk of developing sepsis later in the NICU course, which might identify highest risk infants in need of enhanced vigilance or preventive strategies not suitable for the entire preterm population.72,73

Table 1 A summary of select studies reporting the development and performance of machine learning (ML) models to predict imminent late-onset sepsis.

To date, the only commercially available system for NICU predictive monitoring using continuous bedside monitor data is the HRC index, or HeRO Score. This is also the only ML model for neonatal sepsis that has been tested in a randomized clinical trial and shown to improve important clinical outcomes.74 The HRC algorithm uses electrocardiogram data from standard NICU bedside monitors to calculate the fold-increased risk of a clinical deterioration due to sepsis (culture-proven) or a sepsis-like illness (clinical sepsis) in the next 24 h. The algorithm uses mathematical calculations that report on decreased HR variability and transient HR decelerations, patterns shown in pre-clinical models to reflect pathogen-induced inflammatory cytokine release and vagus nerve firing.37,75 In a randomized clinical trial of 3003 VLBW infants at nine NICUs,74 display of this risk score was associated with significantly lower sepsis-associated mortality(12% versus 20%), presumably due to earlier treatment.76 Importantly, the display of the score resulted in a small increase in the number of blood cultures and antibiotic days, but only among infants with confirmed sepsis.76 This indicated that clinicians may have also used the score for its NPV to decide not to start antibiotics or to discontinue antibiotics in patients with non-specific, mild clinical signs.

Q5: What are some benefits and barriers to sepsis ML model implementation and clinical integration?

Much has been written about the potential benefits of AI implementation in healthcare77 but “AI solutions” will not replace the hard work of clinicians deciding which patients require testing and therapies.78 Properly developed sepsis AI systems might direct clinicians to the right bed at the right time, leading to earlier antibiotic treatment and supportive care leading to improved outcomes. The 20% reduction in sepsis-associated mortality with continuous HRC index display in the HeRO RCT is an example. For survivors of neonatal sepsis, earlier treatment might have other benefits, such as reduced NICU length of stay.79 The caveat, of course, is that attention must be given when implementing sepsis AI to avoid misuse of antibiotics for non-infectious clinical deterioration that is common in preterm infants in the NICU.

Beyond direct patient benefits, other potential benefits of using AI risk models for sepsis include resource allocation and risk stratification, which can be useful for cost-effectiveness analyses, classification for research, and benchmarking across hospitals. Also, care is required in AI model design to avoid introducing biased data into algorithms. A first step in addressing bias in AI is to develop and test algorithms for performance across the spectrum of patient sex, race, ethnicity, and socioeconomic status. With regard to sepsis, a potential advantage of physiology-based algorithms is that heart rate patterns of neonates tend to be similar across the spectrum of patient diversity. Adding pulse oximetry data to HRC will need close scrutiny since racial differences in accuracy of pulse oximetry data have recently been described in adults,80 children,81 and neonates.82 Regardless of what sources of data serve as model input, AI algorithms should be developed and validated in large, diverse patient populations with efforts made to minimize bias of all types.

Although there are many potential benefits of AI, there are also many barriers. We developed the acronym “BARRIERS” to summarize some major challenges in this field: Babies, Analytics, Reactors, Reassurance, Integration, Equipment, Re-education, and Space.83 Babies themselves can complicate the development and deployment of early warning systems for sepsis since they cannot announce that they feel sick, and their signs of sepsis are non-specific and overlap with normal preterm physiology. This creates the problem of false alarms in a unit already prone to alarm fatigue.84,85 Another barrier, “Analytics,” refers to the difficulty in creating models due to heterogeneity of event identification, variable selection, and modeling techniques, as previously discussed in this review. “Reactors” are the model users, NICU clinicians with varied education, experience, and responsibilities. The barrier, in this case, is the difficulty in displaying data and clinical decision support in a way that is effective for a broad range of clinicians. “Reassurance” can be a problem with AI models if a low-risk score falsely reassures the clinical team faced with an infant with significant signs of illness, leading to a delay in treatment. “Integration” refers to the challenge of introducing an AI model without creating too many distracting false alarms. One way to mitigate alarm fatigue yet assure that critical information is transmitted to the right person is to have a centralized clinical team that reviews alerts and determines which ones should be transmitted to the care team,86 an approach that may not be broadly feasible. The “E” in BARRIERS is equipment that must be integrated into the clinical workflow. Once the system is integrated, education and “Re-education” for users are critically important to assure proper implementation. And finally, “Space” can be a barrier since the NICU bedside may already be crowded with equipment and monitors. A new sepsis prediction system needs to be positioned in such a way to be noticeable but not overwhelming.

In the case of an AI system that is shown to improve patient outcomes, implementation relies on hospital administrators and clinicians “buying in.” A survey-based study of continuous predictive monitoring reported that users had positive engagement with the system if they trusted the data used in the model and if they understood the science behind the model outputs.87 This is the basis for the term “explainable AI” which some view as essential for clinicians to utilize the system, although others argue that methods to make models explainable sacrifice their performance.88 A final consideration is that AI systems may introduce unintended consequences such as inappropriate testing and therapies. Further research is needed to characterize (or differentiate) how sepsis AI models perform in events of non-specific clinical deterioration versus culture-proven sepsis.

Conclusion

Sepsis AI is a way to analyze and present data to clinicians for earlier detection and treatment leading to improved patient outcomes. If properly developed and implemented, AI systems can alert clinicians to a change in a patient’s condition that warrants a bedside evaluation. At that point, human intelligence and experience can combine computer-generated risk information with what they see and what they know to make the best decisions for individual patients.