Abstract
Delays in the identification of acute kidney injury in hospitalized patients are a major barrier to the development of effective interventions for treatment. A recent study described a series of models that outperformed previously published models in predicting acute kidney injury up to 48 h in advance, including a recurrent neural network that achieved state-of-the-art performance (area under the curve 0.92) and a gradient-boosted decision tree model that was close behind (area under the curve 0.89). Because these models were trained in a population of US veterans that was 94% male, questions have arisen about its generalizability to other health systems where the populations are more sex balanced. In this study, we aimed to evaluate how well an acute kidney injury model trained in a population of US veterans performs in females at the Veterans Affairs and the extent to which its performance generalizes to a large academic hospital setting. We found that the model performed worse in predicting acute kidney injury in females in both populations, with miscalibration in lower stages of acute kidney injury and worse discrimination (a lower area under the curve) in higher stages of acute kidney injury. We demonstrate that, while this discrepancy in performance can be largely corrected in non-veterans by updating the original model using data from a sex-balanced academic hospital cohort, the worse model performance persists in veterans. Our study sheds light on the importance of characterizing the generalizability of artificial intelligence studies, and on the complexity of discrepancies in model performance in subgroups that cannot be explained simply on the basis of sample size.
This is a preview of subscription content, access via your institution
Access options
Access Nature and 54 other Nature Portfolio journals
Get Nature+, our best-value online-access subscription
$29.99 / 30 days
cancel any time
Subscribe to this journal
Receive 12 digital issues and online access to articles
$119.00 per year
only $9.92 per issue
Buy this article
- Purchase on Springer Link
- Instant access to full article PDF
Prices may be subject to local taxes which are calculated during checkout
Similar content being viewed by others
Data availability
This study used data from the national Veterans Health Administration’s Corporate Data Warehouse and the University of Michigan. Analyses were performed in secure locations within the VA and UM information systems, respectively. The data in this study are not publicly available because they contain protected health information, and restrictions apply to their use. A sample of processed data from six patients has been made available online19.
Researchers interested in obtaining deidentified Michigan Medicine patient data should contact PHDataHelp@umich.edu to obtain guidance on which regulatory and compliance requirements need to be fulfilled to obtain access to the Precision Health data resources. More details about the data and the access process are available at https://precisionhealth.umich.edu/.Source data are provided with this paper.
References
Hoste, E. A. J. et al. Global epidemiology and outcomes of acute kidney injury. Nat. Rev. Nephrol. 14, 607–625 (2018).
Wilson, F. P. et al. Automated, electronic alerts for acute kidney injury: a single-blind, parallel-group, randomised controlled trial. Lancet 385, 1966–1974 (2015).
Koyner, J. L., Adhikari, R., Edelson, D. P. & Churpek, M. M. Development of a multicenter ward-based AKI prediction model. Clin. J. Am. Soc. Nephrol. 11, 1935–1943 (2016).
Koyner, J. L., Carey, K. A., Edelson, D. P. & Churpek, M. M. The development of a machine learning inpatient acute kidney injury prediction model. Crit. Care Med. 46, 1070–1077 (2018).
Peng, J.-C. et al. Development of mortality prediction model in the elderly hospitalized AKI patients. Sci. Rep. 11, 15157 (2021).
Haines, R. W. et al. Acute kidney injury in trauma patients admitted to critical care: development and validation of a diagnostic prediction model. Sci. Rep. 8, 3665 (2018).
Motwani, S. S. et al. Development and validation of a risk prediction model for acute kidney injury after the first course of cisplatin. J. Clin. Oncol. 36, 682 (2018).
Tomašev, N. et al. A clinically applicable approach to continuous prediction of future acute kidney injury. Nature 572, 116–119 (2019).
McCradden, M. D., Stephenson, E. A. & Anderson, J. A. Clinical research underlies ethical integration of healthcare artificial intelligence. Nat. Med. 26, 1325–1326 (2020).
Tomašev, N. et al. Use of deep learning to develop continuous-risk models for adverse event prediction from electronic health records. Nat. Protoc. 16, 2765–2787 (2021).
Google. EHR modeling framework. GitHub https://github.com/google/ehr-predictions (2021).
Haibe-Kains, B. et al. Transparency and reproducibility in artificial intelligence. Nature 586, E14–E16 (2020).
McDermott, M. B. A. et al. Reproducibility in machine learning for health research: still a ways to go. Sci. Transl. Med. 13, eabb1655 (2021).
Stupple, A., Singerman, D. & Celi, L. A. The reproducibility crisis in the age of digital medicine. npj Digit. Med. 2, 2 (2019).
Carter, R. E., Attia, Z. I., Lopez-Jimenez, F. & Friedman, P. A. Pragmatic considerations for fostering reproducible research in artificial intelligence. npj Digit. Med. 2, 42 (2019).
Singh, K., Beam, A. L. & Nallamothu, B. K. Machine learning in clinical journals: moving from inscrutable to informative. Circ. Cardiovasc. Qual. Outcomes 13, e007491 (2020).
Robbins, R. et al. AI systems are worse at diagnosing disease when training data is skewed by sex. STAT https://www.statnews.com/2020/05/25/ai-systems-training-data-sex-bias/ (2020).
Larrazabal, A. J., Nieto, N., Peterson, V., Milone, D. H. & Ferrante, E. Gender imbalance in medical imaging datasets produces biased classifiers for computer-aided diagnosis. Proc. Natl Acad. Sci. USA 117, 12592–12594 (2020).
Singh, K. ML4LHS/va-aki-model: initial release. Zenodo https://doi.org/10.5281/zenodo.7129945 (2022).
World Health Organization International Classification of Diseases (ICD) https://www.who.int/standards/classifications/classification-of-diseases (2022).
Sundararajan, V. et al. New ICD-10 version of the Charlson comorbidity index predicted in-hospital mortality. J. Clin. Epidemiol. 57, 1288–1294 (2004).
Khwaja, A. KDIGO clinical practice guidelines for acute kidney injury. Nephron Clin. Pract. 120, c179–c184 (2012).
Hand, D. J. & Till, R. J. A simple generalisation of the area under the ROC curve for multiple class classification problems. Mach. Learn. 45, 171–186 (2001).
DeLong, E. R., DeLong, D. M. & Clarke-Pearson, D. L. Comparing the areas under two or more correlated receiver operating characteristic curves: a nonparametric approach. Biometrics 44, 837–845 (1988).
Morris, N. tboot: Tilted bootstrap. R package version 0.2.1 (2020).
Friedman, J. H. Greedy function approximation: a gradient boosting machine. Ann. Stat. 29, 1189–1232 (2001).
R Core Team. R: A language and environment for statistical computing. (R Foundation for Statistical Computing, 2022) https://www.R-project.org/
Singh, K. & Meyer, S. R. ML4LHS/gpmodels: initial release. Zenodo https://doi.org/10.5281/zenodo.7158501 (2022).
LeDell, E. h2o: R interface for the ‘H2O’ scalable machine learning platform. R package version 3.36.0.2 (2022).
Pafka, S. GBM performance. GitHub https://github.com/szilard/GBM-perf (2021).
Acknowledgements
This work was supported in part by the Veterans Health Association Innovation Program contract number 36C10B18C2766 (received by X.Z., V.S., H.Y., D.S., R.S., M.H. and K.S.) and through NIDDK R01DK133226 (received by M.M. and K.S.).
Author information
Authors and Affiliations
Contributions
V.S., R.S., S.C., M.H. and K.S. conceived and designed the study. J.C., X.Z., H.Y., D.S., M.H. and K.S. acquired, analysed and interpreted data. J.C., X.Z. and K.S. participated in the creation of the software used in this work. J.C. drafted the manuscript. X.Z., V.S., H.Y., D.S., R.S., S.C., M.M., G.N.N., M.H. and K.S. substantively revised the manuscript. All authors have approved the submitted version and have agreed both to be personally accountable for the author’s own contributions and to ensure that questions related to the accuracy or integrity of any part of the work, even ones in which the author was not personally involved, are appropriately investigated and resolved, and the resolution documented in the literature.
Corresponding author
Ethics declarations
Competing interests
K.S.’s institution receives grant funding from Teva Pharmaceuticals and Blue Cross Blue Shield of Michigan for unrelated work, and K.S. serves on an advisory board for Flatiron Health. M.M. has received research grants from the US National Institutes of Health (NHLBI K01HL141701). G.N.N. is also supported by R01DK108803, U01HG007278, U01HG009610 and 1U01DK116100. G.N.N. reports personal income and equity and stock options from Renalytix and pulseData. G.N.N. is a scientific cofounder of Renalytix, Verici Dx, Pensieve Health, Nexus Health Connect and Data2Wisdom and owns equity in these companies. G.N.N. has received personal income from Siemens Healthineers, Variant Bio, AstraZeneca, Reata, BioVie, Daiichi Sankyo, Cambridge Health Consulting, Qiming Capital and GLG Consulting in the past three years. M.H. receives research grant funding from Astute Medical Inc. and Spectral Medical Inc., and serves as a consultant for Wolters-Kluwer Inc., Potrero Inc. and CardioSounds Inc. The remaining authors declare no competing interests.
Peer review
Peer review information
Nature Machine Intelligence thanks Shalmali Joshi and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Extended data
Extended Data Fig. 1 Model performance (AUC) of the original VA model at each VA hospital.
Model performance of the original model at each VA hospital in the test set, along with characteristics of each VA hospital. A. Model performance with respect to area under the curve (AUC) with 95% CI of the original VA model for predicting AKI-1+ at each VA hospital. The center dot represents the AUC when the original model is applied to the hospital, and the 95% CI is calculated by the DeLong’s method24. B. Number of predictions (after excluding those with AKI-1+ at baseline) at each VA hospital. C. Hospitalization-level AKI-1+ incidence in the test set (after excluding those with AKI-1+ at baseline) at each VA hospital. Five VA hospitals are not shown here due to small cohort sizes (<30 patients).
Extended Data Fig. 2 Calibration of the original VA model a) VA test set b) UM test set.
The calibration of the original model on the a) VA test set and b) UM test set. The predicted probabilities (deciles) are plotted against the observed probabilities with 95% confidence intervals. The diagonal line demonstrates the ideal calibration. The model calibration is examined for all patients (red), females only (green), and males only (blue).
Extended Data Fig. 3 Calibration of the extended VA model at UM.
The calibration of the extended model in the UM test set. The predicted probabilities (deciles) are plotted against the observed probabilities with 95% CI. The diagonal line demonstrates the ideal calibration. The model calibration is examined for all patients (red), females only (green), and males only (blue).
Extended Data Fig. 4 Predictor importance plot of the original and extended VA model.
Top 20 important predictors of the original VA model (top) and the extended VA model (bottom). Predictors are ranked by their relative importance and expressed as a percentage.
Supplementary information
Supplementary Information
Supplementary Tables 1–5
Source data
Source Data Extended Data Fig. 1
Statistical source data for Extended Data Figure 1
Source Data Extended Data Fig. 2
Statistical source data for Extended Data Figure 2
Source Data Extended Data Fig. 3
Statistical source data for Extended Data Figure 3
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Cao, J., Zhang, X., Shahinian, V. et al. Generalizability of an acute kidney injury prediction model across health systems. Nat Mach Intell 4, 1121–1129 (2022). https://doi.org/10.1038/s42256-022-00563-8
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1038/s42256-022-00563-8
This article is cited by
-
A practical guide to the implementation of AI in orthopaedic research – part 1: opportunities in clinical application and overcoming existing challenges
Journal of Experimental Orthopaedics (2023)
-
What is acute kidney injury? A visual guide
Nature (2023)