Introduction

The rapid proliferation of the Electronic Health Record (EHR) and the associated availability of voluminous digitized clinical data has led to tremendous interest in the development of digital health applications. Crucial to this is the ability to subset patients using clinical inclusion and exclusion criteria: commonly referred to as clinical phenotyping, patient screening, or cohort retrieval1,2 (see Fig. 1). Traditionally conducted manually, there has been great interest in accelerating phenotyping via in-silico means3,4. Cross-task generalizable solutions for in-silico phenotyping, however, are not widespread5.

Fig. 1: An example NLP-based clinical phenotyping task.
figure 1

An example clinical phenotyping task for determining whether a patient has a history of working night shifts. On the left, we show how such a criterion might be depicted in plain-text. In the center, we show what such a query might look like for text-based applications. On the right, we show a relevant text fragment from a clinical narrative.

In this work, we introduce Intelligent Machine for Patient Accrual and Classification Tasks (IMPACT), a framework and an example implementation highlighting desiderata for accessible and re-usable in-silico phenotyping tools as observed through our efforts in delivering in-silico phenotyping solutions.

The IMPACT framework for accessible in-silico clinical phenotyping

Variations in task-specific factors such as complexity, required information, and desired results6 have hindered implementation of task-generalizable phenotyping solutions7,8. Here, we present several desiderata for in-silico phenotyping tools, as well as existing approaches, where applicable.

Desideratum I: Be infrastructure-flexible and scalable

Adapting software products is generally easier than switching computing infrastructure, necessitating flexibility in data inputs/outputs and computing infrastructure. This can be accomplished through built-in support for various popular setups, for both data repository type (e.g., SQL, Elasticsearch9, MongoDB10, BigQuery11, Fast Health Interoperability Resources (FHIR)12 datastores) and model (e.g., Observational Medical Outcomes Partnership (OMOP)13 and PCORnet14 Common Data Models (CDMs)).

In addition, tools must be scalable as it would otherwise be unfeasible to run phenotyping across largescale datasets without significant engineering effort/time, particularly when involving data sources requiring natural language processing (NLP) or image processing to extract clinical information.

Desideratum II: Support both ranked score and boolean retrieval schemes

Determining patient classification as a boolean true/false may not always be ideal. Instead, score-based ranking on closeness of match may be appropriate15, particularly during algorithm refinement due to missing evidence (e.g., relevant information not present in data sources used). Boolean retrieval, where patients are classified as either fully matching or not matching a given phenotype, fails to produce results when missing evidence is present. Conversely, ranked retrieval will surface patients that may be missing only a subset of the criteria for further review. Boolean retrieval, however, may still be appropriate once an algorithm matures (e.g., for large-scale cohort accrual), necessitating support for both retrieval modes.

Clinical CDMs such as OMOP13 and PCORnet14 possess boolean retrieval capabilities. Ranked-based retrieval, however, is relatively less prevalent, and approaches focus on unstructured text. Examples of such efforts include the Electronic Medical Record Search Engine (EMERSE)16 and Cohort Retrieval Enhanced by the Analysis of TExt (CREATE)17 systems, as well as the adoption of various open-source frameworks such as Apache Lucene18, Solr19, and Elasticsearch9 for institution-specific implementations.

Desideratum III: Support multi-modal retrieval and result integration

Fully determining whether a patient matches a phenotype may not always be possible with the information contained within any single data source, requiring additional data sources, e.g., for information documented in clinical narratives20,21,22 as opposed to within structured EHR data records, or information from radiology images and associated reports23,24,25.

In addition, traditional EHR-based data sources are potentially biased in that underserved/underrepresented populations will be similarly underrepresented in the data, a significant concern for data-driven downstream applications26,27,28. Inclusion of additional data sources helps ameliorate this issue. For instance, if the site doing the in-silico phenotyping is a tertiary medical institution, a substantial amount of history will not be available structurally (e.g., only available via scanned images or clinical text). If only a structured data source is used for phenotyping, the results will be biased as rural/underrepresented populations may have a substantial history captured in text or image29 and thus inaccessible to the phenotyping algorithm.

Multi-modal computation of complex phenotype definitions consequently complicates in-silico implementation. Manual overhead is introduced via identification of additional necessary data sources, query refinement to local data representations, scoring, and result integration.

These processes should therefore be supported within the tool itself, rather than being left to manual efforts. While solutions do exist for multi-server querying in the general domain (e.g., cross-server joins in SQL), such solutions tend to be difficult to setup, be limited to a single data type, and have scoring be done on a per-data source basis, thus leading to retrieval not being truly multi-modal.

Desideratum IV: Support extensions such that textual phenotype definitions can be autonomously converted into local code sets for review

Many phenotype definitions are distributed as textual descriptions30. For in-silico phenotyping, these textual descriptors are typically manually translated into equivalent institutional data source-computable representations31,32. Similarly, even for those phenotypes distributed as computable representations33,34,35, said representations will typically also need further refinement prior to local use, particularly if natural language processing (NLP) is involved36. Such conversions/refinements (e.g., disease names to International Classification of Diseases 10 codes, or appropriate textual variants for NLP-derived data) are typically done over multiple iterations3, bottlenecking new algorithm implementation.

Collectively predefining valuesets that correspond to a specific phenotype criterion before distribution of the phenotype definition has been proposed37. Usage, however, may not always be feasible for implementing institutions. For instance, while the Logical Observation Identifiers Names and Codes (LOINC) vocabulary is used to codify lab tests, some institutions may use an institution-local code-set without a LOINC mapping. Incorporating standard vocabularies in CDMs such as the OMOP CDM13 partially addresses this issue, but requiring usage of the CDM violates Desideratum I, and implementations are non-uniform5. In addition, the information required for a phenotyping task may not always be fully representable in the CDM. Explicitly defining such valuesets, while helpful as an initial reference point, will therefore often still require additional manual conversion.

To reduce manual burden, increase mapping reusability, and accelerate the implementation of new phenotype definitions, tools should therefore provide the capability to autonomously convert textual descriptions into local representations. An interface should be provided for abstractors to review/refine conversions. In addition, the capability for individual institutions to implement mappings to local datasets from textual descriptions should be provided. Existing examples of such autonomous mapping systems include Eligibility criteria Information Extraction (EliIE)30 and Criteria2Query38. General clinical NLP systems such as MedTagger39 and the Clinical Text Analysis Knowledge Extraction System (cTAKES)40 are also repurposable for this task.

Desideratum V: Maximize reusability and data reproducibility, minimize technical overhead, and enhance downstream generalizability

The domain expertize of typical users of phenotyping tools differs from those that would possess the knowledge to integrate tools with local data sources, and extract information from said data sources. Ideally, as the latter setup process tends to be the bottlenecking step for in-silico phenotyping algorithm implementation, toolsets should be reusable across multiple phenotyping tasks.

Beyond toolset reusability, however, individual phenotyping projects should also be reusable, from both monoinstitutional and multiinstitutional perspectives. As cohort retrieval is typically only an intermediate, but bottlenecking, step for other downstream applications, the ability to easily reuse identified cohorts is highly desirable to reduce duplicate development/phenotyping efforts31,41,42,43.

In addition, given that data reproducibility has been found critically lacking for datasets44,45,46,47, there is substantial benefit in centralized storage of both in-silico phenotyping algorithms and retrieved cohorts within a common toolset for later re-use and/or re-execution.

Finally, while cross-institution sharing of retrieved cohorts is unlikely due to privacy concerns, a common framework with sharable definitions will dramatically facilitate multi-institution phenotyping execution, facilitating development and evaluation of cross-institutionally generalizable digital health applications8,32,48.

These considerations are one of the motivations behind clinical CDMs such as OMOP13 and PCORnet14.

Desideratum VI: Reflect that in-silico phenotyping is an iterative, human-in-the-loop process

The human interpretation and translation process from textual definitions to local data source representations can be highly subjective, leading to inter-abstractor variation both within and without a clinical institution32,49,50.

Consequently, iterative definition refinement is required. This may involve manual review by multiple clinical abstractors to identify missing data elements and adjudicate disagreements in definition interpretations, repeating until adequate performance is achieved51.

To support such algorithm development, refinement, and implementation processes, tools must therefore support: (a) editing/refining phenotype definitions, (b) surfacing evidence supporting classification for review, and (c) identifying abstraction differences for adjudication.

Graphical user frontends supporting querying against the various clinical common data models (e.g., OHDSI Atlas52) support accessible editing phenotyping definitions and reviewing returned results. Such systems, however, typically lack support for presenting supporting evidence and relevance judgement, leading to the development of systems such as PRAI53 and CREATE17.

An example IMPACT implementation

Here, we present a full-stack in-silico phenotyping solution implementing these desiderata consisting of:

  • A web-based frontend user interface (UI) for phenotyping criteria definition and execution, as well as result relevance judgement and adjudication

  • A middleware component supporting cohort management, phenotype definition and abstractor judgement retention, patient evidence retrieval, textual descriptions translation, and job scheduling.

  • A backend that performs data source information retrieval and scoring, FHIR mapping, and writes match status, patient scores, and associated evidence to a database.

An overview of the system architecture using an example fully on-premises deployment is provided in Fig. 2. Additional example diagrams using other infrastructure setups can be found on our GitHub https://www.github.com/OHNLP/IMPACT. In the ensuing subsections, we will detail how IMPACT implements our listed desiderata.

Fig. 2: IMPACT System Architecture.
figure 2

A Diagram Showing an On-Premise Deployment of IMPACT. Desideratum I is implemented via the Local Data Warehouse, Desideratum II is implemented via the Terms Scoring Module, Desideratum III is implemented via the evidence aggregation module, Desideratum IV is implemented via the query translator, Desideratum V is implemented via the middleware application, and Desideratum VI is implemented via the web frontend.

Infrastructurally-agnostic, scalable, ranking-based patient-phenotype matching

To address scalability while maintaining flexibility across differing infrastructure setups, we implemented the backend using Apache Beam54, which is usable both across a wide variety of horizontally scaling frameworks, as well as on a single machine. For more details on horizontal scaling and the specific frameworks supported by the example IMPACT implementation, please refer to the Supplementary Information.

For ranked scoring, we leverage a modification of BM25 + 55,56 to score patients relative to how well they match the phenotype, where each patient is treated as a “document” and clinical entities such as a diagnosis or a lab test are “tokens” within said “document”. Firstly, leaf criterion (i.e., is not a combinatorial boolean condition such as “must have all of”, “at least n of”, “none of”, or similar, but rather a description of a condition, medication, etc.) are grouped such that they are of the same clinical entity type, and BM25+ scoring is run separately for each. Specifically, the base BM25+ score for a given patient P and leaf criterion ci can be calculated as shown in eq. (1):

$$BM{25}^{+}({c}_{i},P)={\rm{In}}\left(\frac{N-n({c}_{i})+0.5}{n({c}_{i})+0.5}+1\right)\ast \left(\frac{f({c}_{i},P)\ast ({k}_{1}+1)}{f({c}_{i},P)+{k}_{1}\ast \left(1-b+b\ast \frac{|P|}{avgplen}\right)}+\delta \right)$$
(1)

where N is the number of patients in the data source, n(ci) is the number of patients that leaf criterion ci matches, f(ci, P) is the number of distinct records for which patient matches criterion ci, |P| is the patient term length (i.e., number of entities of the same clinical data type (condition, medication, etc) as ci), avgplen is the average |P| across all patients in the cohort. The BM25+ scores of leaf criteria are then combined based on the boolean logic as defined by the phenotype definition. For OR (“must have at least n of”), the mean of the top scores of child criteria is used. For AND (“must have all of”), the mean score of all children is used. For NOT (“must not have”), the maximum of all child scores is multiplied by −1. For more details on the BM25+ algorithm, its selection as our default scoring algorithm, and associated hyperparameters, please refer to the Supplementary Information. A Java application programming interface (API) is also provided for implementing custom scoring algorithms.

Data source flexibility via FHIR conversions, CDM support, and JSON-based plug and play configuration

For IMPACT, we chose to use HL7 Fast Health Interoperability Resources (FHIR) R412 data structures as our internal representation for clinical data. For more details on FHIR and why it was chosen, please refer to the Supplementary Information.

So long as a mapping function can be written to produce FHIR resources, any data source can be used in IMPACT. To facilitate adoption, we supply built-in functions for common use cases. For SQL/JDBC compatible data sources, a configurable mapping function is provided that allows users to specify SQL queries and associated FHIR mappings via JavaScript Object Notation (JSON) config. For on-demand clinical NLP (i.e., artifacts extracted at runtime), we build upon our previous work57 to provide a clinical information extraction mapping function that extracts clinical entities to text and converts them58,59 to appropriate FHIR resources. Built-in support and mapping functions for the OMOP13 (including NLP tables) and PCORnet14 CDMs are also provided that allow for immediate, out-of-the-box, use with minimal additional configuration. Custom mapping functions can also be included via implementation of a Java API.

IMPACT supports cross-server data integration by allowing for an arbitrary number of data sources to be queried on any given phenotyping task so long as common patient IDs are used (or can be mapped) and a FHIR mapping function is defined. The data sources and mappings used for scoring are specified as part of a JSON configuration and can be customized on a per-project basis via the frontend GUI. Individual patient scores are computed per-data source and are then combined using a weighted summation (please refer to the Supplementary Information section on BM25+ scoring for more details).

Autonomous NLP-based conversion of textual phenotype definitions

To generate data source-computable representations from textual definitions, the middleware component contains an integrated MedTagger39,57 pipeline to perform named entity recognition and entity linking to Unified Medical Language System (UMLS)60 concept codes (CUIs). For more information on the UMLS, coding systems, and the necessity of codeset mapping, please refer to the Supplementary Information. Each leaf criterion (i.e., some clinical entity that is part of the phenotype definition, as opposed to non-leaf criterion, which refers to the boolean logics such as “must have all/one/none of …” that links multiple leaf criterion together) automatically goes through this pipeline to generate a UMLS CUI code set if no computable representations are provided. This process can also be manually triggered by the end user. The UMLS CUIs are then converted to local data source formats depending on data source configurations. IMPACT offers built in mapping to any UMLS source vocabulary, to the OHDSI Athena Vocabulary61, as well any UMLS subset for the on-demand NLP data source. In addition, manual mappings from UMLS CUIs can be provided via configuration. End users may also extend our Java API to implement their own mapping function.

The generated representations are then grouped by data source and displayed in the frontend web interface for refinement by clinical abstractors.

Re-usable infrastructure and phenotype representations and associated implications on data reproducibility and downstream algorithm generalizability

Thus far, we have primarily discussed backend components that must be setup on initial deployment. Once this setup is complete, the system can be re-used across a large variety of phenotyping tasks without additional setup/technical expertize required (unless the addition of more data sources is desired), thus greatly accelerating implementation of new phenotyping algorithms. In addition, common re-usable infrastructure greatly accelerates porting to multiinstitutional settings, facilitating generalizable algorithm development.

The retention of abstractor curated representations of a phenotype by the middleware component enables later re-use. To maximize re-use, users may choose to publicize these collections of representations within the IMPACT platform and share with other users at the same institution.

Central storage of the refined algorithms and datasets on the middleware server also greatly enhances data provenance/reproducibility. Should the algorithm need to be re-ran (e.g., for updated data temporally), the original local representations and associated refinements are retained, as well as a specific record of which datasets/data sources were queried in the original retrieval. Similarly, should it be desired to re-use the retrieved patient cohort itself, the retrieved cohort along with human judgements and associated query metadata is retained for immediate download.

Human in the loop evidence review and adjudication

The web frontend offers an interface for phenotype definition (Fig. 3) and displays a list of patients sorted by match score (Fig. 4), with the option to switch to boolean filtering. Upon patient selection, the user is presented with the definition. The abstractor can view the evidence and judge their correctness for each definition criterion (Fig. 5). Switching to adjudication mode lists judgment conflicts between all abstractors.

Fig. 3: IMPACT phenotype definition page.
figure 3

On the left panel, the user-defined phenotype definition is shown. On the top right, textual definitions can be mapped to datasource-local representations. On the bottom-right, datasource representations for specific criteria that were previously manually curated and shared can be retrieved and reused.

Fig. 4: IMPACT patient accrual results page.
figure 4

A display of accrued patients that have been found to match a query phenotype definition (Fig. 3) in ranked order by closeness of match, alongside match status, abstraction/relevance judgement, and abstractor-supplied tags. An additional button to view matching criteria in more detail (Fig. 5) is also provided.

Fig. 5: IMPACT evidence display page.
figure 5

A display of matching evidence by specific criteria elements. On the left pane, the query phenotype definition as a whole and whether a patient has been determined to match a given criterion is displayed. In the center, a listing of specific facts/evidence supporting a match/not match determination for the actively selected criterion is listed, with details on each individual fact/evidence item displayed on the right (including highlighted sections of clinical text, for NLP-based facts).

These capabilities bring several benefits. Firstly, having the relevant evidence aggregated and presented to the adjudicator by matching phenotype criterion accelerates determination of whether a given patient matches the query phenotype. In addition, to perform iterative refinement and fine-tuning of phenotyping algorithms, algorithm errors (and evidence associated with said errors) must first be identified Having disagreement/adjudication functions built into the interface greatly facilitates this process. Finally, this interface/human-in-the-loop approach allows for the inclusion of external contextual information that may be absent from or contradict the clinical documentation itself, which may be helpful for certain use cases, e.g., “patient was contacted for a clinical trial, indicated that he had an undocumented positive/disqualifying smoking status”.

Discussion

The desiderata presented here are not comprehensive: they are the results of our observations while implementing in-silico phenotyping, but experiences will vary. As such, we anticipate evolution in the framework as part of our open science efforts as feedback from users is incorporated. In addition, individual approaches to the various desiderata exist, but to our knowledge are spread across disparate toolsets and not integrated into a common solution. For example, while Atlas does offer phenotyping query execution, it is limited to using the OMOP CDM and does not support text retrieval. Similarly, EMERSE offers querying on text but has limited flexibility for working with multi-modal queries. Our current implementation is therefore intended to serve as a baseline that works reasonably and is easy to adopt/extend, but may not be state-of-the-art. To facilitate customization with other approaches, the application allows for modular component swapping.

A trade-off of infrastructure flexibility is runtime performance. Specifically, FHIR mapping is done on-demand to obviate instantiating a new data warehouse. Around 90%, per instrumentation, of runtime is spent on FHIR mapping. For reference, our observed performance using 128 central processing unit cores was 6 h for 1.9 million patients (with structured data and NLP). While this is still a significant improvement over manual efforts, pre-mapping/storing FHIR resources into a data store such as MongoDB or Elasticsearch, obviating on-demand mapping, would be more efficient.

Finally, while evaluations have previously been done on individual component implementations, a full evaluation in aggregate would be helpful. Due to the characteristics inherent to the phenotyping task, a meaningful systemic evaluation would require multiinstitutional deployment of the application and gold standard corpora development for each site across a variety of phenotyping tasks. For more details on this, please refer to the Supplementary Information. We have left such efforts to future work.

Conclusions

Rapid in-silico clinical phenotyping on large datasets is of critical importance to accelerate research and development in the digital health domain. In this article, we have outlined some underlying complications hindering implementation of in-silico phenotyping and presented a framework, accompanied by an example implementation, addressing them.