Background & Summary

Biomedical corpora, such as scientific articles and patient reports, contain a wealth of knowledge and information that can be used to enable high-quality research. However, the extraction of knowledge from these free-text sources is a challenging task as it requires the ability to understand the meaning of natural language and the idiosyncrasies of the biomedical domain but also due to the volume of the data1. Biomedical natural language processing (NLP) techniques have been used to analyze information from free-text sources at scale, enabling the extraction and synthesis of biomedical information, and transforming unstructured data into a structured format2,3.

Compared to general corpora, NLP models face three main challenges for semantic representation of biomedical data4,5,6,7. First, the number of biomedical entities is extremely high. For example, the SNOMED-CT ontology8 defines more than 300’000 medical concepts while the UniProt Knowledgebase (UniProtKB)9 contains more than 550’000 curated proteins. Combined, the number of concepts described in these two knowledge organization systems is higher than the number of terms defined in dictionaries for many natural languages. Second, biomedical concepts have many synonyms and alternative expressions for the same concept. For example, in Fig. 1 the concept “C0007134” defined in the Unified Medical Language System (UMLS) thesaurus can be represented with at least four terms: “Renal Cell Carcinoma”, “RCC”, “Nephroid Carcinoma”, and “Adenocarcinoma”. Third, biomedical corpora are notorious for their overabundance of abbreviations and acronyms10. These abbreviations and acronyms are often polysemous, e.g., the acronym “RCC” in Fig. 1 belongs to two concepts – “C2826323” and “C0007134” – making their semantic representation even more challenging.

Fig. 1
figure 1

Illustration of concept ambiguity in the biomedical domain. Left: Example of the UMLS 2021AB data structure, where one term refers to different concepts as well and one concept may be represented with different mentions. Right: Example of a paragraph with numerous polysemous acronyms and abbreviations from a biomedical journal52. Acronyms and abbreviations are highlighted in bold.

Entity linking11 and word sense disambiguation (WSD)12 are two NLP tasks trying to address the issue of semantic representation in the biomedical field. Entity linking systems aim to connect terms mentioned in a text with corresponding concepts in a knowledge organization system13,14. For instance, the abbreviation “CA” in biomedical contexts can stand for either “calcium”, an essential mineral in the human body, or “cancer”, a group of diseases characterized by abnormal cell growth. An ideal entity linking system would employ contextual cues to correctly map“CA” to its standardized form in a chosen knowledge base, e.g., UMLS. This proper alignment assists in reducing ambiguity, enhancing the understanding of biomedical corpora15,16. In the biomedical domain, a wide array of datasets exists for entity linking, each employing distinct text corpora as their primary contextual resource. For instance, MedMentions17 and BC5CDR18 focus on biomedical abstracts, N2C2 201919 on clinical notes, and COMETA20 on social media content. These datasets are also differentiated by their target ontologies. For instance, MedMentions17 aligns with UMLS, BC5CDR18 connects to MeSH, and SMM4H21 links with the MedDRA ontology. Each dataset serves a unique purpose within the biomedical entity linking landscape.

Given a word in context, the objective of WSD is to associate the word with its correct meaning in a sense inventory22,23. For example, in the sentence “The patient has been suffering from a cold.”, the sense for the word cold should be associated with its medical meaning as opposed to temperature or literature (i.e., James Bond novel by John Gardner) meanings. Two of the most prominent biomedical WSD datasets are MSH WSD24 and NLM WSD25. The MSH WSD dataset, created by the National Library of Medicine, comprises 37’888 instances across 203 ambiguous terms and abbreviations from the Medical Literature Analysis and Retrieval System Online (MEDLINE) 2010 baseline, each linked to the MeSH ontology. Similarly, the NLM WSD dataset, also developed by the National Library of Medicine, includes 5’000 instances for 50 ambiguous biomedical terms, with each instance linked to UMLS. Despite the steps forward in this promising research direction, the main limitation of the current approach to the WSD task lies in the restriction on the range of word and sense representations defined by the predefined sense inventories26,27.

To bridge this gap, the Word-in-Context (WiC) benchmark26 presented a novel perspective on WSD, dropping the requirement of traditional formulation of WSD task to the fixed sense inventory27. WiC formulates WSD as a binary classification task, where a polysemous word appears in two different sentences, and the task is to infer whether the word holds the same meaning or not. WiC has been integrated as a component of SuperGLUE28, a comprehensive evaluation framework designed to assess the performance of natural language understanding systems. XL-WiC27 and TempoWiC29 are two recent extensions of WiC adapting it to 12 different languages and targeting the detection of meaning shifts in Twitter, respectively. The WiC-TSV (Target Sense Verification of Words in Context) dataset30 is closely related to WiC and focuses on a binary disambiguation task, determining if the contextually intended sense of a word aligns with a pre-defined target sense. This dataset comprises general domain instances in its training and development sets, but the test set is distinctively composed of instances in the general domain as well as three domain-specific subsets: cocktails, medicine, and computer science. For all instances, the primary context source is the Wikilinks dataset31. For the biomedical domain instances in WiC-TSV specifically, the target sense definitions are sourced from the MeSH ontology. The main limitation of this dataset is the small number of biomedical instances it offers — 205 instances representing 8 unique biomedical terms. Moreover, the dataset’s scope is limited as it only includes target terms and definitions from the MeSH ontology. These constraints could potentially limit the effectiveness of the dataset in the development and evaluation of comprehensive WSD systems in the biomedical domain.

Despite significant progress both in WSD and entity linking tasks in the biomedical domain15,31,32,33,34, there exists no benchmark that specifically targets the semantic representation of biomedical terms as a WiC-style task. To bridge this gap, we present the BioWiC35 benchmark, a novel dataset that provides high-quality annotations for the evaluation of contextualized term representations in the biomedical domain. Inspired by the WiC26, we formulate BioWiC as a binary classification task, whose aim is to identify whether two target terms in their respective contexts have the same meaning. In addition to its focus on biomedical concepts, BioWiC differs from WiC in several ways. First, in contrast to WiC which focuses on single token words, as targets, BioWiC allows for terms that can be single words, phrases, or multiword expressions. Second, BioWiC terms may be represented not only by the same terms in different contexts but also by different term forms referring to the same concept (or not). The dataset is named “BioWiC”, reflecting its design for the biomedical domain while showcasing its relation to the WiC task.

A key attribute of BioWiC35 is its flexibility and scalability. Unlike WSD and entity linking that is restricted to concepts covered by existing knowledge graphs, BioWiC can be expanded independently of such resources. This is because expanding the dataset for a novel concept can be accomplished by annotating instances where two sentences contain the target concept, regardless of whether or not it is included in any existing knowledge organization resource. This flexibility allows for continual evolution and improvement, independent of updates to standardized resources, providing a more comprehensive and up-to-date resource for research in the biomedical field.

Methods

In this section, we present BioWiC35 – a novel benchmark dataset for evaluating in-context biomedical concept representations. First, we explain the resources we used to create the corpus and the pre-processing steps. We then provide an overview of the methodology used to create the dataset and discuss the processes for instance generation, dataset splitting, and quality assessment.

BioWiC resources

As shown in Table 1, BioWiC35 instances were built using annotations from the following manually curated biomedical entity linking datasets:

Table 1 General statistics of BioWiC35 resources.

MedMentions17: this is the largest entity linking dataset in the biomedical domain. It includes 4’392 PubMed abstracts and over 350’000 mentions linked to UMLS. The full MedMentions version covers 128 UMLS semantic types. However, as stated by17, the concepts can be either too expansive (e.g., “Group, South Asia”) or cover peripheral and supplementary topics (e.g., “Rural Area, No difference”). Thus, we follow36,37 and focus on the officially released subset of MedMentions called ST21pv (21 Semantic Types from Preferred Vocabularies), which contains 203’282 biomedical mentions from 21 UMLS semantic types.

BC5CDR18: introduced in the BioCreative challenge, this dataset comprises 1’500 PubMed abstracts and 13’343 mentions linked to Medical Subject Headings (MeSH) concepts. The dataset covers a wide range of biomedical entities, including 4’409 chemicals, 5’818 diseases, and 3’116 instances of chemical-disease interactions.

NCBI Disease38: developed by the National Center for Biotechnology Information (NCBI), this dataset includes biomedical information derived from 793 PubMed abstracts. It comprises 6’892 disease mentions, each associated with their relevant standardized forms in the MeSH or Online Mendelian Inheritance in Man (OMIM) terminologies.

Data pre-processing

To have homogeneous word-in-context instances from different resources, we unified their format using the following steps:

  • Sentence segmentation: Each BioWiC35 instance is composed of a pair of target terms together with their respective sentences. We use the PySBD library39, version 0.3.4, to determine sentence boundaries in the initial source texts (i.e., abstracts of publications). We parse documents and keep only sentences that contain mapped mentions.

  • Label unification: The source datasets of BioWiC35 map mentions (i.e., terms) have different target knowledge organization resources, i.e., MeSH, OMIM, and UMLS. This results in concept codes, i.e., unique identifiers in the target ontology, that cannot be directly comparable. To address this issue, we used UMLS as the main reference and transferred the concept identifiers from MeSH and OMIM to UMLS using available ontology mappings in UMLS 2021AB. To avoid ambiguity, MeSH or OMIM concepts with multiple mappings in UMLS 2021AB were removed.

BioWiC construction

BioWiC35 instances follow a similar format to WiC, where each instance involves a pair of biomedical terms (w1 and w2) and their corresponding sentences (s1 and s2). The task is to classify each instance as True if the target terms carry the same meaning across both sentences or False if they do not. We represent each instance as a tuple pair t = [(s1,w1),(s2,w2)]: y where w1 and w2 are the target terms, s1 and s2 are the corresponding sentences, and y is the associated binary label. Table 2 presents some examples of BioWiC instances. In contrast to WiC, where both target terms of each instance always share the same lemma, BioWiC allows for variations such as abbreviations, synonyms, identical terms, and terms with similar surface forms.

Table 2 BioWiC35 instances, drawn from the test split.

To evaluate challenging scenarios for semantic representation, such as synonymy, polysemy, and abbreviations, BioWiC35 is divided into four main groups of instances. Group A (term identity) contains instances where the target terms w1 and w2 are identical. In group B (abbreviations), either w1 or w2 could represent the abbreviation of the other one. Group C (synonyms), consists of instances where w1 and w2 could be synonyms (according to UMLS). Lastly, group D (label similarity) includes instances where w1 and w2 share similar surface forms. We employed the following five steps to generate the BioWiC instances:

  1. (i)

    Sentence collection: We first gathered all the sentences from the source datasets manually annotated with terms M(W,C) = {(w1, c1), (w2, c2), …, (wn, cn)}, where wW is a term and cC is a concept defined in UMLS. Then, we created a set S = {s1, …, sn}, where each sentence sS has at least one mention wW linked to cC.

  2. (ii)

    Tuple creation: For each sentence sS, we randomly chose one of the annotated mentions w and created a set of sentence-term tuples P = {(s1, w1), (s2, w2),…,(sn, wn)}, where for each (si, wi) P, si includes wi. We then paired the tuples of P and created a collection of tuple pairs:

    $$T=\left\{\left[({s}_{1},{w}_{1}),({s}_{2},{w}_{2})\right],\left[({s}_{1},{w}_{1}),({s}_{3},{w}_{3})\right],\ldots ,\left[({s}_{m},{w}_{m}),({s}_{n},{w}_{n})\right]\right\}.$$
  3. (iii)

    Instance definition and labeling: We considered each pair t = [(si,wi), (sj,wj)] T as a potential BioWiC35 instance, where wi and wj serve as target terms and si and sj are their corresponding sentences, respectively. Each instance is labeled as y = True if the target terms wi and wj were linked to the same or synonym UMLS concept, and as y = False if they were not. We then added the label y to each tuple pair to create the dataset of possible BioWiC instances t = [(si,wi),(sj,wj)]: y.

  4. (iv)

    Tuple selection: We categorized each instance t: y to one of the main groups of BioWiC35. Group A included instances for which wi and wj are identical. Group B included instances where wi is the abbreviated form of wj or vice-versa. Group C included instances where wi and wj could be synonyms. Group D included instances where wi and wj are not identical but share similar surface characteristics.

  5. (v)

    Dataset splitting: We divided the instances into three parts: training set, development set, and test set, providing a consistent and reliable framework for model training and evaluation.

For clarity, in Fig. 2 we provide an example of building BioWiC35 instances for the target term “delivery”. Initially, we preprocess the resource data and extract all sentences in which “delivery” is linked to UMLS. We transform each sentence to the sentence-term tuple (si,w) format where si represents a sentence containing the term w = “delivery”. Subsequently, we permute all possible combinations of tuples (si,w) identified in the preceding step to generate BioWiC instances t = [(si,w),(sj,w)], where “delivery” serves as the target term in both sentences. Finally, we classify each instance as True when “delivery” is mapped to the same CUI code in both sentences and as False when it is not.

Fig. 2
figure 2

The overall pipeline of the BioWiC35 construction process. Step 1: Pre-process the source documents to a consistent format. Step 2: Identify and retrieve sentences including the term “delivery” linked to UMLS. Step 3: Pair the retrieved sentences to generate BioWiC instances. In Step 3, the green box shows an example of a BioWiC instance with the same target concept, while the red boxes show examples of different target concepts.

Instance generation

To build the BioWiC35 instances, we considered two main challenges of biomedical texts: semantic and lexical ambiguities. The presence of semantically ambiguous terms, that is, terms that can have multiple meanings in different contexts, is one of the most difficult aspects of biomedical text processing3. For instance, the term staph can be used either as a type of disease (usually followed by infection) or bacteria in other contexts. In addition, one concept can be used in different domains to represent meaning. To assess the capability of language models to provide context-sensitive representations for a term across different contexts, we included a group of instances (group A) in BioWiC in which a target biomedical term appears in two different contexts. Another key challenge in the biomedical domain is that terms can be expressed in various forms or using different lexical formats, even if they refer to the same biomedical concepts. To account for this challenge, we developed three other groups of BioWiC instances to measure language models’ ability to use context and produce similar representations for synonym terms with different surface strings. We categorize synonyms into three different groups: i) abbreviations, ii) synonyms, and iii) concepts with similar surface characteristics. Each instance in these groups contains two target terms with different surfaces, each occurring in a different context and the models should identify whether these terms refer to the same biomedical concept or not.

Instance groups

In what follows, we discuss how we created the instances for each group:

  1. (A)

    Term identity: To create these instances, we use the tuple pair list, built-in step 3 of the construction pipeline, and consider every pair t = [(si,wi),(sj,wj)] T as an instance of group A if wi and wj are identical. We classified each t as True if both terms were linked to the same UMLS CUI and False otherwise. Two instances of this type are shown in Table 2 (examples one and two). In the first example, both target terms refer to the same concept and have the same meaning (i.e., toxicity that impairs or damages the heart, UMLS CUI C0876994). So, the instance label is True. In the second instance, however, the target terms are mapped to different CUIs (C0032914 and C0034065), and thus the instance label is False.

  2. (B)

    Abbreviations: In this group, one of the target terms is the abbreviated form of the other one, e.g., heart rate and hr. From the tuple pair list, we pick up all the pairs t = [(si, wi),(sj,wj)] T if wi is the abbreviated form of wj or vise-versa. To verify this, we generate the abbreviated form of wi by combining the initial letters from each part obtained after splitting it (e.g., “FEO” is considered as the abbreviation of “familial expansile osteolysis”). Next, we compare whether wj is the same as the abbreviation of wi. We perform the same procedure for wj as well. If either of the wi or wj is the abbreviation of the other, we categorize the tuple pair into this group. Each tuple pair then is assigned to the label True if wi and wj are mapped to the same UMLS and False otherwise. As shown in example 3 of Table 2, “FEO” in sentence 1 is used as the abbreviation of “familial expansile osteolysis”. So the instance is labeled as True. In example 4, however, the target term PD does not have the same meaning as “Periodontal disease” and thus the instance is labeled as False.

  3. (C)

    Synonyms: This group refers to instances in which the target terms w1 and w2 belong to the same UMLS concept. Each UMLS synonym set consists of a group of biomedical synonym concepts that express the same meaning. As shown in Fig. 1, due to semantic ambiguity, biomedical concepts with several distinct meanings can be represented by several distinct synonym sets. For instance, “Adenocarcinoma” could have the same meaning as either “Renal Cell Carcinoma” (CUI C0007134) or “Carcinoma in adenoma” (CUI C0001418). Consequently, we consider these concepts as potential synonyms, which may or may not hold the same meaning depending on their context. To create the instances, we collect all the tuple pairs t = [(si,wi),(sj,wj)] from T in which wi and wj both are present in a UMLS synonym set. We then assigned the label True to each instance if wi and wj are linked to the same UMLS CUI code, and False if they are not. Two examples of this group of instances are shown in Table 2.

  4. (D)

    Label similarity: Despite broad coverage of synonyms and semantic types, UMLS synonym sets still suffer a lack of a large number of reformed concepts that can be used in biomedical contexts. For instance, the concept “chronic pseudomonas aeruginosa infection” can be reformed as “chronic PA infection”, which is not covered by UMLS. To deal with this and to cover a wide range of target concepts with different formats in the dataset, we developed the fourth group of instances in which the corresponding terms have a high Levenshtein distance ratio (see examples 7 and 8 in Table 2). To create such instances, we retrieve all tuple pairs t = [(si,wi),(sj,wj)] T in which the Levenshtein distance between wi and wj surpasses the threshold of 0.75. Each tuple t is marked as True when wi and wj correspond to the identical UMLS entry, and False in the other case. The main idea behind this strategy was to include instances where target terms have similar surface forms but refer to different medical concepts. Two instances of this group are shown in Table 2. In example 7, both “piebald” and “piebaldism” refer to the same concept, whereas in example 8, “anemic” and “anaemia” refer to two different concepts.

Data Records

BioWiC35 dataset is available on Figshare (https://doi.org/10.6084/m9.figshare.25611591.v2), HuggingFace (https://huggingface.co/datasets/hrouhizadeh/BioWiC), and GitHub (https://github.com/hrouhizadeh/BioWiC). It comprises three distinct JSON files: training set, development set, and test set. Each instance within a JSON file includes ten parts. The first two items, term1 and term2, followed by sentence1 and sentence2, correspond respectively to the two target terms and two sentences within each instance. The character-level positioning of target terms is defined by start1 and start2, indicating the starting positions, and end1 and end2, marking the end positions within their respective sentences. Furthermore, the cat attribute classifies each instance into one of the BioWiC groups, i.e., term_identity, abbreviations, synonyms, or label_similairty. Lastly, a binary label is attached to each instance, taking the value of either 1 (True) or 0 (False).

Technical Validation

Dataset splits

We divided the BioWiC35 instances into three main parts i.e., training set, development set, and test set, thereby establishing a structured and robust framework for model development and evaluation. To do so, we first built the test set including 2’000 instances with three constraints: 1) only one instance for each unique pair of target terms, 2) no sentence repetition between instances, and 3) no overlap between sentences and term pairs of the test set and training or development sets. The primary objective of rules 1 and 2 was to ensure a diverse range of term pairs and sentences in the test set. Rule 3 was also introduced to assess the generalization power of the language models, i.e., the model’s ability to adapt to new, previously unseen data. Taking into account the constraints mentioned, we randomly sampled a set of 2000 term pair instances from the groups defined in section 2.3.1 (800, 200, 800, and 200 samples for term identity, abbreviations, synonyms, and label similarity groups, respectively) to build the testing data set. Finally, we used the remaining instances to create the training set. General statistics of the different splits of BioWiC are reported in Table 3. In addition, following WiC, we balanced all the data splits in terms of the number of tags, i.e., 50% True and 50% False.

Table 3 General statistics of BioWiC35, divided by splits.

During the compilation of the training set, we adopted a simple approach where we only included examples of their corresponding sentences that did not exceed a certain frequency threshold. We built the training set with various thresholds, ranging from 1 to 200, to determine the most appropriate limit. As illustrated in Fig. 3, the size of the training set, the number of unique concepts, and the number of semantic types in the training set varied based on these thresholds. It was observed that once a sentence recurrence surpassed 100 times, the incremental growth of the training set size as well as the number of unique concepts was marginal, registering below 2%. Furthermore, if the threshold is set higher, the number of unique semantic types included in the training set will not exceed 98. As a result, we chose 100 as our cut-off point.

Fig. 3
figure 3

Impact of different thresholds for max sentence repetition in the training set. Left: Impact on the training set size; Center: Impact on the frequency of unique concepts; Right: Impact on the frequency of unique UMLS semantic types.

Quality control

UMLS is known as a broadly used resource in the biomedical domain, covering a wide range of biomedical concepts. A key feature of UMLS is its capability to connect a wide range of concepts from different biomedical terminologies, such as SNOMED CT, LOINC, MeSH, RxNorm, etc. Through this mapping, one single code from a source terminology can be mapped to several UMLS CUI codes. For instance, MeSH code D020274, which represents “Depressive Disorder” is mapped to three distinct UMLS CUIs, C5671289, C0751871, and C0751872, for “Autoimmune Encephalitis”, “Autoimmune Diseases of the Nervous System” and “Immune Disorders, Nervous System”, respectively. In our dataset, there are instances where different CUI codes are assigned to the target concepts, resulting in the False label. However, the CUI codes and the confusion and same code in alternative ontologies, underlying concepts represented by those codes are equivalent. To prevent any confusion and to ensure the dataset’s reliability, we have employed a pruning strategy and removed the instances in which the target terms are mapped to multiple UMLS codes, while those UMLS codes correspond to the same code in another ontology. The process also involved eliminating any pairs whose CUIs are considered synonyms as per the MRREL.RRF file from UMLS. We also followed WiC26 and XL-WiC27 and filtered out all the pairs where one CUI is directly related to the other as a broader concept in the UMLS hierarchy.

Cross-mapping validation

To determine the quality of BioWiC35, we extracted two random subsets of 100 instances (with 50 mutual instances) from the test set and asked two domain experts to label them. Both annotators were medical doctors with vast experience in semantic annotation. They were provided with a set of instructions including a short description of the task as well as a few examples of labeled instances. During the annotation process, no external information from UMLS or any other resources was provided to the experts. The annotators had Cohen’s Kappa score of 0.84 which is representative of the high quality of the dataset. An average human-level accuracy of 0.80 (0.80 and 0.81 for annotator 1 and annotator 2 respectively) was obtained through the annotation process, which can be viewed as the upper bound for model performance.

Dataset coverage

In this section, we focus on the scope of the dataset by studying the unique CUI codes and comparing them to the total CUI present in UMLS. Additionally, we investigate the semantic types within the dataset, examining both the number included and the proportions among them. Table 3 shows that BioWiC35 covers over 5,000 unique CUI codes from UMLS. Additionally, BioWiC includes almost 80% of UMLS semantic types, i.e., 99 out of 127, across different splits. This wide coverage is indicative of the dataset’s comprehensive and its potential as a valuable resource for biomedical research. In Fig. 4, we present the ratio of the top 10 semantic types and semantic groups included in BioWiC. Additionally, Table 4 shows the frequency and proportion of target terms across different BioWiC splits, categorized by their token counts.

Fig. 4
figure 4

Distribution of UMLS semantic types and semantic groups in BioWiC35. Left: Top 10 semantic types; Right: Top 10 semantic groups.

Table 4 Distribution of terms based on token count across different BioWiC35 splits, presented in counts and corresponding proportion.

Compared to WSD datasets in the biomedical domain, BioWiC35 stands out as the most comprehensive in terms of the variety of unique biomedical terms it includes, covering a total of 7’413 distinct terms. This range far surpasses that of other datasets, such as MSH WSD24 with 203 terms, NLM WSD25 with 50 terms, and WiC-TSV30, which includes only 8 terms. Moreover, the extensive scope of BioWiC is emphasized by its incorporation of 99 different semantic types from UMLS, in contrast to the narrower range covered by other datasets, i.e., MSH WSD24, NLM WSD25, and WiC-TSV30, which include 81, 46, and 8 UMLS semantic types respectively.

Baseline experiments

We have implemented several baseline models, covering all the SuperGLUE28 benchmark suites. Considering that all divisions of BioWiC35 are balanced in terms of positive and negative instances, we take the same approach as WiC26 and use the accuracy metric to measure the performance of different models. This is determined by the percentage of correctly predicted cases (whether they are true positives or true negatives) compared to the total number of samples. The baselines include:

Random: We provide a lower bound for the performance by randomly assigning a class to each instance.

GloVe: In this baseline, we used GloVe-840B40 pre-trained embeddings. We averaged token embeddings to represent each sentence and fed the resulting feature vector to an MLP classifier (with 128 neurons in the hidden layer and one neuron in the output layer).

Bi-LSTM: We also trained a BiLSTM model (with 128 hidden units) to capture both the forward and backward context information of the sentence. The BiLSTM model output was fed into a fully connected layer with one output neuron for binary classification.

BERT: We explored the performance of several BERT-based models to provide stronger baselines for the BioWiC35 task. To evaluate well language model’s performance generalized to concepts of the biomedical domain, our baseline includes three general transformer-based language models – BERT41, RoBERTa42, and ELECTRA43. In addition, to assess the effect of prior knowledge of language models on biomedical concept representation, we evaluated the performance of three language models pre-trained with biomedical and clinical data – BioBERT44, Bio_ClinicalBERT45, and SciBERT46 trained on PubMed abstracts and PubMed Central, the MIMIC-III database47, and papers from Semantic Scholar (mostly in the biomedical domain), respectively. To fine-tune each model, we used the Sentence-BERT48 framework, which incorporates siamese and triplet network architectures to generate semantically meaningful embeddings. We pre-processed each input sentence by enclosing the target terms within double quotes, emphasizing their significance, and fed the modified sentences into the BERT architecture for further processing. We have also used a different pre-processing technique for input sentences in our BERT models. Supplemental Table 1 in the Supplementary Information section compares the results of both strategies.

Llama-2: We also conducted experiments using three different versions of the Llama-2 language model, i.e., Llama-2-7b, Llama-2-13b, and Llama-2-70b49. Our experiments involve a few-shot approach where the language model receives a small number of examples before making predictions and a fine-tuning approach, where we utilized the BioWiC35 instances to fine-tune the language models.

BERT/Llama-2++: We conducted additional experiments where we incorporated the general domain data from the WiC dataset26 as additional training data for fine-tuning the transformer-base models. By expanding our training data with extra instances from the general domain, we aim to explore the potential benefits of leveraging diverse sources of information for the BioWiC35 task.

Results

The performance of the baseline models on the BioWiC35 benchmark is presented in Fig. 5. The results indicate that the state-of-the-art language models fine-tuned on the BioWiC training set, surpass the random baseline by a margin of 18% to 26% (p-value < 0.001). Both GloVe and BiLSTM baselines are unable to compete with the fine-tuned large language models. Overall, Llama-2-70b outperforms all competing methods, achieving the highest accuracy. The closest to the Llama-2-70b model in terms of accuracy are BioBERT, BioBERT++, and SciBERT++, which Llama-2-70b outperforms by 2% (p-value = 0.04). It is worth noting that in contrast to the different variations of the Lamma-2 language model, which are pre-trained on general domain corpora, BioBERT is pre-trained on large biomedical data, allowing it to understand complex biomedical texts effectively44. However, Llama-2-70b achieves state-of-the-art performance, illustrating its high capability for adapting to the task of representing biomedical terms in context.

Fig. 5
figure 5

Accuracy of the baseline models on the BioWiC35 test set. ++ indicates that data from WiC was added to the training set. Min, mean, median, and max statistics exclude the random performance.

In our analysis of different Llama-2 models, we observe a significant difference in performance depending on the method used in our evaluation, i.e., few-shot learning or fine-tuning. As shown in Fig. 5, Llama-2-7b surpassed the random baseline by a slight margin in the few-shot setting; however, its performance increased by 17% upon fine-tuning (p-value < 0.001). This pattern of performance boost was consistent with the other Llama-2 variants. Specifically, after the fine-tuning process, the accuracy of Llama-2-13b improved from 0.61 to 0.73 (p-value < 0.001), while Llama-2-70b experienced an increase from 0.68 to 0.78 (p-value < 0.001). These observations emphasize the crucial role of the fine-tuning phase in enhancing the contextualized representation of biomedical terms. Additionally, our results are consistent with a prior study50, which demonstrated that the GPT-3 language model failed to surpass random baseline performance on the WiC dataset under a few-shot evaluation.

Comparing the performance of different BERT-based models shows that BioBERT and SciBERT achieve the highest performance among different groups of the test set. Overall, BioBERT outperforms SciBERT by a slight margin of 1% accuracy, i.e., 0.75 and 0.76 (p-value = 0.04), respectively. The potential reason for the superior performance of BioBERT and SciBERT can be attributed to their pre-training phase on large biomedical corpora. This provides them with an in-depth knowledge of biomedical terminologies and concepts, leading to more accurate representations of terms and expressions when compared to BERT-based models pre-trained on the general domain corpora44. Surprisingly, Bio_ClinicalBERT performance is similar to the general domain BERT models and does not align with other superior biomedical BERT variants.

Further analysis of the results for different groups indicates that the “term identity” and “synonyms” groups present a greater challenge compared to the other groups for all models. Regarding model performance for the “label similarity” group, it is plausible that minor changes in term structure carry meaningful distinctions in biomedical contexts. Models might utilize structural alterations, such as the addition of suffixes or prefixes, influencing the meanings of terms. This understanding of term structure can be particularly relevant and beneficial for performance in the “label similarity” group. As for the “abbreviations” group, it is important to note that abbreviations are commonly used in the biomedical domain. The models may have come across these abbreviations (along with their full form) in various contexts during both the pre-training and fine-tuning phases. This exposure to abbreviations in diverse settings helps the models to effectively learn and capture their meanings. The group of “synonym” instances appears to be more difficult for models to handle. This might be because, in the biomedical field, a single term can have multiple synonyms with varied forms and each synonym can have multiple meanings (as shown in Fig. 1) which makes it hard for the models to recognize synonym terms with different shapes across different contexts. For the “term identity” group, since this group of instances doesn’t present any difference between the target terms, the models cannot rely on lexical cues and must prioritize the comprehension of the contextual information from the surrounding context, which makes the task more challenging.

In our study, we also conducted experiments in which we incorporated general domain training data from WiC26 into our dataset (denoted by adding++ to the name of the language model). We observe slight fluctuations in the performance of the models when merging general and biomedical domain datasets. It could be possibly explained by the fact that the model faces potential distribution shifts due to the distinct nature of each domain. Despite the increased volume of training data, this misalignment in data distributions can offset the advantages of the added samples. Thus, while the combined dataset is larger, it may not necessarily lead to improved model performance in the biomedical context.

Alternative evaluation scenarios

To gain a deeper understanding of how models perform in the BioWiC35 benchmark, we analyzed their performance in two alternative scenarios. First, we assessed how the data distribution impact their results. Here, we considered seen and unseen data distributions. Second, we assessed what is the influence of the training corpus on the performance. Differently, in this scenario, we are interested to see whether learning from general corpus examples would enable models to generalise to the biomedical domain.

Seen vs unseen: In this analysis, the aim is to evaluate the variation in performance based on whether the target terms in the instances have been previously seen during training or not. For this purpose, we used the models fine-tuned on the BioWiC35 training set and divided the test set into two categories: “seen” and “unseen”. The first category includes instances where the model has been exposed to at least one of the target terms during training, while the second category involves instances where both target terms are new to the model. Table 5 reports the number and proportion of seen and unseen data across different groups within the BioWiC test set. Note that term pairs (the two target terms of each instance) and the sentences in the test set are unique and were not presented to the model during its training phase.

Table 5 Distribution of seen and unseen instances in different groups of BioWiC35 test set.

Table 6 shows the accuracy of different models, fine-tuned on the BioWiC35 training set when tested on seen and unseen data sets. As we can see, the models exhibit a significant decline in performance, i.e., between 5% and 13%, when classifying unseen instances. Interestingly, models demonstrate improved performance on the unseen data in the “abbreviation” groups, aligning with the notion that abbreviations are prevalent across contexts and models may possess prior knowledge in this aspect. Overal, the findings suggest that there is huge scope for improvement in this field, particularly as the performance of models decreases when encountering novel data.

Table 6 Comparative analysis of model accuracy on BioWiC35 test set.

Cross-domain analysis: We conducted additional experiments to assess the performance of language models when fine-tuned exclusively on data from the general domain, specifically WiC. The results indicate that all models experience a substantial decrease in performance when fine-tuned only with WiC data (Table 6). This highlights the importance of the training data provided by BioWiC35 in enhancing the ability of language models in the representation of different forms of concepts within the biomedical field. Furthermore, this suggests that the differences in terminology and linguistic patterns between the biomedical and general domains might be a reason why models fine-tuned on BioWiC exhibit superior performance.

Evaluating models’ upper bound: To assess whether state-of-the-art models have reached an upper bound on the BioWiC dataset, we leveraged two subsets of 100 instances from the BioWiC test set that were manually annotated by subject matter experts (see the cross-mapping validation section for more details). On the 50 instances annotated by both experts, we observed strong inter-annotator agreement (Cohen’s Kappa score = 0.84), confirming the quality of the dataset annotations. However, the best-performing model (Llama-2-70b) exhibited low agreement with the human annotators on this mutually annotated subset (Cohen’s Kappa scores of 0.35 and 0.36). The pattern of discrepancies between human and model annotations persisted across the two subsets of 100 instances (Cohen’s Kappa scores of 0.33 and 0.47 for annotators 1 and 2, respectively). These results highlight the substantial room for improvement of language models to represent contextualized biomedical terms.

Usage Notes

The primary objective of this study is to develop a novel biomedical dataset, BioWiC35, introducing unique challenges for biomedical concept representation. The complexity of the biomedical language, with its abundance of polysemous terms, abbreviations, and acronyms, highlights the need for models to accurately disambiguate the intended meanings of terms based on the context they appear. We propose that BioWiC can serve as a robust benchmark dataset, enabling NLP models to better understand the intended meaning of biomedical terms within their given textual context, allowing models to generate representations that precisely capture those intended meanings across different contexts. This enhanced contextual understanding is critical for several downstream NLP tasks in the biomedical domain, such as information retrieval, question-answering, and machine translation, where accurately interpreting the meaning of terms within their specific context is essential for optimal model performance51.

The proposed benchmark has certain limitations that should be taken into consideration. The breadth of coverage of concepts is rather limited as BioWiC35 only deals with a small subset of the concepts present in the biomedical domain, i.e., 5’000 CUIs out of 4.5 M CUI codes available in UMLS. Moreover, it may not be adequate for certain use cases that require a specific coverage of concepts, e.g., genomics and proteomics. Additionally, our benchmark is currently designed to work with medical documents written in English only. Lastly, it is a static benchmark, in the sense that it does not currently provide a seamless platform (i.e., web service) for users to contribute to it through crowd-sourcing. This limits the ability to keep the benchmark up-to-date and reflective of the latest developments in the biomedical domain. These limitations can be addressed in future versions of the benchmark.