Harnessing EHR data for health research

Tang, Alice S.; Woldemariam, Sarah R.; Miramontes, Silvia; Norgeot, Beau; Oskotsky, Tomiko T.; Sirota, Marina

doi:10.1038/s41591-024-03074-8

Review Article
Published: 04 July 2024

Harnessing EHR data for health research

Nature Medicine (2024)Cite this article

Metrics details

Subjects

Abstract

With the increasing availability of rich, longitudinal, real-world clinical data recorded in electronic health records (EHRs) for millions of patients, there is a growing interest in leveraging these records to improve the understanding of human health and disease and translate these insights into clinical applications. However, there is also a need to consider the limitations of these data due to various biases and to understand the impact of missing information. Recognizing and addressing these limitations can inform the design and interpretation of EHR-based informatics studies that avoid confusing or incorrect conclusions, particularly when applied to population or precision medicine. Here we discuss key considerations in the design, implementation and interpretation of EHR-based informatics studies, drawing from examples in the literature across hypothesis generation, hypothesis testing and machine learning applications. We outline the growing opportunities for EHR-based informatics studies, including association studies and predictive modeling, enabled by evolving AI capabilities—while addressing limitations and potential pitfalls to avoid.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on Springer Link
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

**Fig. 1: EHR data flow and sources of heterogeneity and bias.**

References

Gillum, R. F. From papyrus to the electronic tablet: a brief history of the clinical medical record with lessons for the digital age. Am. J. Med. 126, 853–857 (2013).
Article PubMed Google Scholar
US Food and Drug Administration. Real-World Evidence. FDA https://www.fda.gov/science-research/science-and-research-special-topics/real-world-evidence/ (5 February 2023).
Office of the National Coordinator for Health Information Technology. National Trends in Hospital and Physician Adoption of Electronic Health Records. HealthIT.gov https://www.healthit.gov/data/quickstats/national-trends-hospital-and-physician-adoption-electronic-health-records/ (2021).
Liu, F. & Panagiotakos, D. Real-world data: a brief review of the methods, applications, challenges and opportunities. BMC Med. Res. Methodol. 22, 287 (2022).
Article PubMed PubMed Central Google Scholar
Cowie, M. R. et al. Electronic health records to facilitate clinical research. Clin. Res. Cardiol. 106, 1–9 (2017).
Article PubMed Google Scholar
Kierkegaard, P. Electronic health record: wiring Europe’s healthcare. Comput. Law Secur. Rev. 27, 503–515 (2011).
Article Google Scholar
Wen, H. -C., Chang, W. -P., Hsu, M. -H., Ho, C. -H. & Chu, C. -M. An assessment of the interoperability of electronic health record exchanges among hospitals and clinics in Taiwan. JMIR Med. Inform. 7, e12630 (2019).
Article PubMed PubMed Central Google Scholar
Johnson, A. E. W. et al. MIMIC-IV, a freely accessible electronic health record dataset. Sci. Data 10, 1 (2023).
Article CAS PubMed PubMed Central Google Scholar
All of Us Research Program Investigators. The ‘All of Us’ Research Program. N. Engl. J. Med. 381, 668–676 (2019).
Sudlow, C. et al. UK Biobank: an open access resource for identifying the causes of a wide range of complex diseases of middle and old age. PLoS Med. 12, e1001779 (2015).
Article PubMed PubMed Central Google Scholar
Sinha, P., Sunder, G., Bendale, P., Mantri, M. & Dande, A. Electronic Health Record: Standards, Coding Systems, Frameworks, and Infrastructures (Wiley, 2012); https://doi.org/10.1002/9781118479612
Overhage, J. M., Ryan, P. B., Reich, C. G., Hartzema, A. G. & Stang, P. E. Validation of a common data model for active safety surveillance research. J. Am. Med. Inform. Assoc. 19, 54–60 (2012).
Article PubMed Google Scholar
Murugadoss, K. et al. Building a best-in-class automated de-identification tool for electronic health records through ensemble learning. Patterns 2, 100255 (2021).
Article PubMed PubMed Central Google Scholar
Yogarajan, V., Pfahringer, B. & Mayo, M. A review of automatic end-to-end de-identification: is high accuracy the only metric? Appl. Artif. Intell. 34, 251–269 (2020).
Article Google Scholar
Mandl, K. D. & Perakslis, E. D. HIPAA and the leak of ‘deidentified’ EHR data. N. Engl. J. Med. 384, 2171–2173 (2021).
Article PubMed Google Scholar
Norgeot, B. et al. Protected Health Information filter (Philter): accurately and securely de-identifying free-text clinical notes. NPJ Digit. Med. 3, 57 (2020).
Article PubMed PubMed Central Google Scholar
Steurer, M. A. et al. Cohort study of respiratory hospital admissions, air quality and sociodemographic factors in preterm infants born in California. Paediatr. Perinat. Epidemiol. 34, 130–138 (2020).
Article PubMed Google Scholar
Costello, J. M., Steurer, M. A., Baer, R. J., Witte, J. S. & Jelliffe‐Pawlowski, L. L. Residential particulate matter, proximity to major roads, traffic density and traffic volume as risk factors for preterm birth in California. Paediatr. Perinat. Epidemiol. 36, 70–79 (2022).
Article PubMed Google Scholar
Yan, C. et al. Differences in health professionals’ engagement with electronic health records based on inpatient race and ethnicity. JAMA Netw. Open 6, e2336383 (2023).
Article PubMed PubMed Central Google Scholar
Lotfata, A., Moosazadeh, M., Helbich, M. & Hoseini, B. Socioeconomic and environmental determinants of asthma prevalence: a cross-sectional study at the U.S. county level using geographically weighted random forests. Int. J. Health Geogr. 22, 18 (2023).
Article PubMed PubMed Central Google Scholar
Li, L. et al. Identification of type 2 diabetes subgroups through topological analysis of patient similarity. Sci. Transl. Med. 7, 311ra174 (2015).
Article PubMed PubMed Central Google Scholar
De Freitas, J. K. et al. Phe2vec: automated disease phenotyping based on unsupervised embeddings from electronic health records. Patterns 2, 100337 (2021).
Article PubMed PubMed Central Google Scholar
Tang, A. S. et al. Deep phenotyping of Alzheimer’s disease leveraging electronic medical records identifies sex-specific clinical associations. Nat. Commun. 13, 675 (2022).
Article CAS PubMed PubMed Central Google Scholar
Su, C. et al. Clinical subphenotypes in COVID-19: derivation, validation, prediction, temporal patterns, and interaction with social determinants of health. NPJ Digit. Med. 4, 110 (2021).
Article PubMed PubMed Central Google Scholar
Glicksberg, B. S. et al. PatientExploreR: an extensible application for dynamic visualization of patient clinical history from electronic health records in the OMOP common data model. Bioinformatics 35, 4515–4518 (2019).
Article CAS PubMed PubMed Central Google Scholar
Huang, Z., Dong, W., Bath, P., Ji, L. & Duan, H. On mining latent treatment patterns from electronic medical records. Data Min. Knowl. Discov. 29, 914–949 (2015).
Article Google Scholar
Zaballa, O., Pérez, A., Gómez Inhiesto, E., Acaiturri Ayesta, T. & Lozano, J. A. Identifying common treatments from electronic health records with missing information. An application to breast cancer. PLoS ONE 15, e0244004 (2020).
Article CAS PubMed PubMed Central Google Scholar
Lou, S. S., Liu, H., Harford, D., Lu, C. & Kannampallil, T. Characterizing the macrostructure of electronic health record work using raw audit logs: an unsupervised action embeddings approach. J. Am. Med. Inform. Assoc. 30, 539–544 (2023).
Article PubMed Google Scholar
Glicksberg, B. S. et al. Comparative analyses of population-scale phenomic data in electronic medical records reveal race-specific disease networks. Bioinformatics 32, i101–i110 (2016).
Article CAS PubMed PubMed Central Google Scholar
Obermeyer, Z., Powers, B., Vogeli, C. & Mullainathan, S. Dissecting racial bias in an algorithm used to manage the health of populations. Science 366, 447–453 (2019).
Article CAS PubMed Google Scholar
Smith, M. A. et al. Insights into measuring health disparities using electronic health records from a statewide network of health systems: a case study. J. Clin. Transl. Sci. 7, e54 (2023).
Article PubMed PubMed Central Google Scholar
Swerdel, J. N., Hripcsak, G. & Ryan, P. B. PheValuator: development and evaluation of a phenotype algorithm evaluator. J. Biomed. Inform. 97, 103258 (2019).
Article PubMed PubMed Central Google Scholar
Denny, J. C. et al. PheWAS: demonstrating the feasibility of a phenome-wide scan to discover gene-disease associations. Bioinformatics 26, 1205–1210 (2010).
Article CAS PubMed PubMed Central Google Scholar
Chen, C., Ding, S. & Wang, J. Digital health for aging populations. Nat. Med. 29, 1623–1630 (2023).
Article CAS PubMed Google Scholar
Woldemariam, S. R., Tang, A. S., Oskotsky, T. T., Yaffe, K. & Sirota, M. Similarities and differences in Alzheimer’s dementia comorbidities in racialized populations identified from electronic medical records. Commun. Med. 3, 50 (2023).
Article PubMed PubMed Central Google Scholar
Austin, P. C. An introduction to propensity score methods for reducing the effects of confounding in observational studies. Multivariate Behav. Res. 46, 399–424 (2011).
Article PubMed PubMed Central Google Scholar
Karlin, L. et al. Use of the propensity score matching method to reduce recruitment bias in observational studies: application to the estimation of survival benefit of non-myeloablative allogeneic transplantation in patients with multiple myeloma relapsing after a first autologous transplantation. Blood 112, 1133 (2008).
Article Google Scholar
Ho, D., Imai, K., King, G. & Stuart, E. A. MatchIt: nonparametric preprocessing for parametric causal inference. J. Stat. Softw. 42, 8 (2011).
Article Google Scholar
Zhang, Z., Kim, H. J., Lonjon, G. & Zhu, Y. Balance diagnostics after propensity score matching. Ann. Transl. Med. 7, 16 (2019).
Article CAS PubMed PubMed Central Google Scholar
Landi, I. et al. Deep representation learning of electronic health records to unlock patient stratification at scale. NPJ Digit. Med. 3, 96 (2020).
Article PubMed PubMed Central Google Scholar
Bai, W. et al. A population-based phenome-wide association study of cardiac and aortic structure and function. Nat. Med. https://doi.org/10.1038/s41591-020-1009-y (2020).
Engels, E. A. et al. Comprehensive evaluation of medical conditions associated with risk of non-Hodgkin lymphoma using medicare claims (‘MedWAS’). Cancer Epidemiol. Biomark. Prev. 25, 1105–1113 (2016).
Article CAS Google Scholar
Bastarache, L., Denny, J. C. & Roden, D. M. Phenome-wide association studies. J. Am. Med. Assoc. 327, 75–76 (2022).
Article Google Scholar
Yazdany, J. et al. Rheumatology informatics system for effectiveness: a national informatics‐enabled registry for quality improvement. Arthritis Care Res. 68, 1866–1873 (2016).
Article Google Scholar
Nelson, C. A., Bove, R., Butte, A. J. & Baranzini, S. E. Embedding electronic health records onto a knowledge network recognizes prodromal features of multiple sclerosis and predicts diagnosis. J. Am. Med. Inform. Assoc. 29, 424–434 (2022).
Article PubMed Google Scholar
Tang, A. S. et al. Leveraging electronic health records and knowledge networks for Alzheimer’s disease prediction and sex-specific biological insights. Nat. Aging 4, 379–395 (2024).
Article PubMed PubMed Central Google Scholar
Mullainathan, S. & Obermeyer, Z. Diagnosing physician error: a machine learning approach to low-value health care. Q. J. Econ. 137, 679–727 (2022).
Article Google Scholar
Makin, T. R. & Orban De Xivry, J. -J. Ten common statistical mistakes to watch out for when writing or reviewing a manuscript. eLife 8, e48175 (2019).
Article CAS PubMed PubMed Central Google Scholar
Carrigan, G. et al. External comparator groups derived from real-world data used in support of regulatory decision making: use cases and challenges. Curr. Epidemiol. Rep. 9, 326–337 (2022).
Article Google Scholar
Hersh, W. R. et al. Caveats for the use of operational electronic health record data in comparative effectiveness research. Med. Care 51, S30–S37 (2013).
Article PubMed PubMed Central Google Scholar
Rudrapatna, V. A. & Butte, A. J. Opportunities and challenges in using real-world data for health care. J. Clin. Invest. 130, 565–574 (2020).
Article PubMed PubMed Central Google Scholar
Belthangady, C. et al. Causal deep learning reveals the comparative effectiveness of antihyperglycemic treatments in poorly controlled diabetes. Nat. Commun. 13, 6921 (2022).
Article CAS PubMed PubMed Central Google Scholar
Roger, J. et al. Leveraging electronic health records to identify risk factors for recurrent pregnancy loss across two medical centers: a case–control study. Preprint at Res. Sq. https://doi.org/10.21203/rs.3.rs-2631220/v2 (2023).
Gervasi, S. S. et al. The potential for bias in machine learning and opportunities for health insurers to address it: article examines the potential for bias in machine learning and opportunities for health insurers to address it. Health Aff. 41, 212–218 (2022).
Article Google Scholar
Sai, S. et al. Generative AI for transformative healthcare: a comprehensive study of emerging models, applications, case studies, and limitations. IEEE Access 12, 31078–31106 (2024).
Article Google Scholar
Wang, M. et al. A systematic review of automatic text summarization for biomedical literature and EHRs. J. Am. Med. Inform. Assoc. 28, 2287–2297 (2021).
Article PubMed PubMed Central Google Scholar
Katsoulakis, E. et al. Digital twins for health: a scoping review. NPJ Digit. Med. 7, 77 (2024).
Article PubMed PubMed Central Google Scholar
Thirunavukarasu, A. J. et al. Large language models in medicine. Nat. Med. 29, 1930–1940 (2023).
Article CAS PubMed Google Scholar
Meskó, B. & Topol, E. J. The imperative for regulatory oversight of large language models (or generative AI) in healthcare. NPJ Digit. Med. 6, 120 (2023).
Article PubMed PubMed Central Google Scholar
Hastings, J. Preventing harm from non-conscious bias in medical generative AI. Lancet Digit. Health 6, e2–e3 (2024).
Article CAS PubMed Google Scholar
Lett, E., Asabor, E., Beltrán, S., Cannon, A. M. & Arah, O. A. Conceptualizing, contextualizing, and operationalizing race in quantitative health sciences research. Ann. Fam. Med. 20, 157–163 (2022).
Article PubMed PubMed Central Google Scholar
Belonwu, S. A. et al. Sex-stratified single-cell RNA-seq analysis identifies sex-specific and cell type-specific transcriptional responses in Alzheimer’s disease across two brain regions. Mol. Neurobiol. https://doi.org/10.1007/s12035-021-02591-8 (2021).
Article PubMed PubMed Central Google Scholar
Krumholz, A. Driving and epilepsy: a review and reappraisal. J. Am. Med. Assoc. 265, 622–626 (1991).
Article CAS Google Scholar
Xu, J. et al. Data-driven discovery of probable Alzheimer’s disease and related dementia subphenotypes using electronic health records. Learn. Health Syst. 4, e10246 (2020).
Article PubMed PubMed Central Google Scholar
Vyas, D. A., Eisenstein, L. G. & Jones, D. S. Hidden in plain sight—reconsidering the use of race correction in clinical algorithms. N. Engl. J. Med. 383, 874–882 (2020).
Article PubMed Google Scholar
Dagdelen, J. et al. Structured information extraction from scientific text with large language models. Nat. Commun. 15, 1418 (2024).
Article CAS PubMed PubMed Central Google Scholar
Hu, Y. et al. Improving large language models for clinical named entity recognition via prompt engineering. J. Am. Med. Inform. Assoc. 27, ocad259 (2024).
Article Google Scholar
Microsoft. microsoft/FHIR-Converter (2024).
Torfi, A., Fox, E. A. & Reddy, C. K. Differentially private synthetic medical data generation using convolutional GANs. Inf. Sci. 586, 485–500 (2022).
Article Google Scholar
Yoon, J., Jordon, J. & van der Schaar, M. GAIN: missing data imputation using generative adversarial nets. Preprint at https://arxiv.org/abs/1806.02920v1 (2018).
Shi, J., Wang, D., Tesei, G. & Norgeot, B. Generating high-fidelity privacy-conscious synthetic patient data for causal effect estimation with multiple treatments. Front. Artif. Intell. 5, 918813 (2022).
Article PubMed PubMed Central Google Scholar
Stuart, E. A. Matching methods for causal inference: a review and a look forward. Stat. Sci. 25, 1–21 (2010).
Article Google Scholar
Murali, L., Gopakumar, G., Viswanathan, D. M. & Nedungadi, P. Towards electronic health record-based medical knowledge graph construction, completion, and applications: a literature study. J. Biomed. Inform. 143, 104403 (2023).
Article PubMed Google Scholar
Li, Y. et al. BEHRT: transformer for electronic health records. Sci. Rep. 10, 7155 (2020).
Article CAS PubMed PubMed Central Google Scholar
Guo, L. L. et al. EHR foundation models improve robustness in the presence of temporal distribution shift. Sci. Rep. 13, 3767 (2023).
Article CAS PubMed PubMed Central Google Scholar
Zhu, R. et al. Clinical pharmacology applications of real‐world data and real‐world evidence in drug development and approval—an industry perspective. Clin. Pharmacol. Ther. 114, 751–767 (2023).
Article PubMed Google Scholar
Voss, E. A. et al. Accuracy of an automated knowledge base for identifying drug adverse reactions. J. Biomed. Inform. 66, 72–81 (2017).
Article CAS PubMed Google Scholar
Taubes, A. et al. Experimental and real-world evidence supporting the computational repurposing of bumetanide for APOE4-related Alzheimer’s disease. Nat. Aging 1, 932–947 (2021).
Article PubMed PubMed Central Google Scholar
Gold, R. et al. Using electronic health record-based clinical decision support to provide social risk-informed care in community health centers: protocol for the design and assessment of a clinical decision support tool. JMIR Res. Protoc. 10, e31733 (2021).
Article PubMed PubMed Central Google Scholar
Varga, A. N. et al. Dealing with confounding in observational studies: a scoping review of methods evaluated in simulation studies with single‐point exposure. Stat. Med. 42, 487–516 (2023).
Article PubMed Google Scholar
Carrigan, G. et al. Using electronic health records to derive control arms for early phase single‐arm lung cancer trials: proof‐of‐concept in randomized controlled trials. Clin. Pharmacol. Ther. 107, 369–377 (2020).
Article PubMed Google Scholar
Infante-Rivard, C. & Cusson, A. Reflection on modern methods: selection bias—a review of recent developments. Int. J. Epidemiol. 47, 1714–1722 (2018).
Article PubMed Google Scholar
Degtiar, I. & Rose, S. A review of generalizability and transportability. Annu. Rev. Stat. Appl. 10, 501–524 (2023).
Article Google Scholar
Badhwar, A. et al. A multiomics approach to heterogeneity in Alzheimer’s disease: focused review and roadmap. Brain 143, 1315–1331 (2020).
Article PubMed Google Scholar
Stuart, E. A. & Rubin, D. B. Matching with multiple control groups with adjustment for group differences. J. Educ. Behav. Stat. 33, 279–306 (2008).
Article Google Scholar
Hernan, M. A. & Robins, J. M. Causal Inference: What If (Taylor and Francis, 2024).
Hernan, M. A. Causal knowledge as a prerequisite for confounding evaluation: an application to birth defects epidemiology. Am. J. Epidemiol. 155, 176–184 (2002).
Article PubMed Google Scholar
Dang, L. E. et al. A causal roadmap for generating high-quality real-world evidence. J. Clin. Transl. Sci. 7, e212 (2023).
Article PubMed PubMed Central Google Scholar
Hernán, M. A. & Robins, J. M. Using big data to emulate a target trial when a randomized trial is not available. Am. J. Epidemiol. 183, 758–764 (2016).
Article PubMed PubMed Central Google Scholar
Oskotsky, T. et al. Mortality risk among patients with COVID-19 prescribed selective serotonin reuptake inhibitor antidepressants. JAMA Netw. Open 4, e2133090 (2021).
Article PubMed PubMed Central Google Scholar
Sperry, M. M. et al. Target-agnostic drug prediction integrated with medical record analysis uncovers differential associations of statins with increased survival in COVID-19 patients. PLoS Comput. Biol. 19, e1011050 (2023).
Article CAS PubMed PubMed Central Google Scholar
Amit, G. et al. Antidepressant use during pregnancy and the risk of preterm birth – a cohort study. NPJ Womens Health 2, 5 (2024); https://doi.org/10.1038/s44294-024-00008-0

Download references

Author information

Authors and Affiliations

Bakar Computational Health Sciences Institute, University of California, San Francisco, San Francisco, CA, USA
Alice S. Tang, Sarah R. Woldemariam, Silvia Miramontes, Tomiko T. Oskotsky & Marina Sirota
Qualified Health, Palo Alto, CA, USA
Beau Norgeot
Department of Pediatrics, University of California, San Francisco, San Francisco, CA, USA
Marina Sirota

Authors

Alice S. Tang
View author publications
You can also search for this author in PubMed Google Scholar
Sarah R. Woldemariam
View author publications
You can also search for this author in PubMed Google Scholar
Silvia Miramontes
View author publications
You can also search for this author in PubMed Google Scholar
Beau Norgeot
View author publications
You can also search for this author in PubMed Google Scholar
Tomiko T. Oskotsky
View author publications
You can also search for this author in PubMed Google Scholar
Marina Sirota
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Marina Sirota.

Ethics declarations

Competing interests

B.N. is an employee at Qualified Health. The other authors declare no competing interests.

Peer review

Peer review information

Nature Medicine thanks Wenbo Wu and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Primary Handling Editor: Karen O’Leary, in collaboration with the Nature Medicine team.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Tang, A.S., Woldemariam, S.R., Miramontes, S. et al. Harnessing EHR data for health research. Nat Med (2024). https://doi.org/10.1038/s41591-024-03074-8

Download citation

Received: 03 January 2024
Accepted: 17 May 2024
Published: 04 July 2024
DOI: https://doi.org/10.1038/s41591-024-03074-8

Harnessing EHR data for health research

Subjects

Abstract

Access options

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Competing interests

Peer review

Peer review information

Additional information

Rights and permissions

About this article

Cite this article

Search

Quick links

Subjects

Abstract

Access options

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Competing interests

Peer review

Peer review information

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Search

Quick links