Introduction

Gene mutations that confer drug resistance to pathogens are an emerging public health and medical threat with, additionally, potential consequences for current and future pandemic response. A step towards combating this scourge is to study the way in which particular mutations lead to viral drug-resistant phenotypes. Viral species hosted by an individual patient can contain multiple mutations that interact to produce specific phenotypes. To understand mutational interactions, the scientific literature can help in interpreting existing knowledge on the biological mechanisms associated with viral mutation-phenotype relationships. Here we explore the feasibility of a novel integrated approach that combines clinical and text mining data to gather existing knowledge and produce new insights on drug resistance based on data from viral species hosted by individual patients, and in particular for the case of HBV. Such approaches have the potential to be applied to other viral diseases, singularly due to the growing acknowledgement of the critical importance of viral mutation monitoring in patients.

Hepatitis B is an infectious disease that affects approximately 292 million people worldwide. Despite being a major global health concern, it has been estimated that only 10% of individuals who are chronically infected with HBV have been diagnosed, and only 5% of those individuals who are eligible for treatment receive an antiviral therapy1. Chronic HBV infection is known to progress to cirrhosis of the liver in up to 40% of untreated patients2; and in 2015, HBV led to an estimated 887,000 deaths, which were largely caused by cirrhosis of the liver and hepatocellular carcinoma (HCC)3. There are two classes of treatments that are known to be effective against suppressing HBV infections: interferons and nucleotide analogues. These treatments can reduce the patient’s viral load and the main impacts of the disease on the patient. However, although these treatments have been available for nearly two decades, they have not eliminated HBV4.

HBV is classified into ten HBV genotypes, A to J, each with a distinct geographic distribution5,6. The infectious HBV virion, also known as the Dane particle, is a spherical, double-shelled structure with a diameter of 42 nm7. This particle consists of an outer lipoprotein envelope and an inner nucleocapsid core, which encapsulates the viral genome6. The genome of HBV is approximately 3.2 kilobase pairs long and has a partially double-stranded circular DNA7. The viral genome codes for all five viral proteins required for HBV replication: the HBV surface antigen; the HBV core antigen; the HBV envelope antigen; the X protein; and the HBV reverse transcriptase/polymerase.

There are four known genes in the viral genome: C, X, P, and S. Gene C encodes the core protein, and it also produces the pre-core protein by promoting an upstream AUG start codon as its start codon8. Genes P and S encode for DNA polymerase and HBsAg, respectively. The S gene is an open long reading frame that is divided into three sections based on the start codons: pre-S1, pre-S2, and S9. The mechanism and function of the protein coded by gene X, HBV X antigen, remains largely unknown, but it is known to be associated with the development of liver cancer10.

Due to its high mutation rate, with commonly accepted rates of ~ 2.0 × 10–5 nucleotide substitutions per site per year11, HBV can become resistant to HBV drugs relatively quickly. With the emergence of drug resistance, patients may experience a notable compromise in the efficacy of anti-viral therapy and deterioration of the disease condition12,13. Thus, investigation of resistance mutations is crucial for the successful development of robust HBV drugs. Lamivudine (phosphorylated lamivudine) and adefovir dipivoxil were the first nucleotide analogues that were developed but had limited success because of the development of resistant variants. Resistance mutations pertaining to nucleotide analogues are usually reported in the polymerase gene of HBV DNA (i.e., gene P)14. The resistance phenotype is typically a consequence of nucleotide analogues failing to bind to DNA polymerase15.

Lamivudine and adefovir diphosphate are widely known for treating HBV16. After conducting randomized clinical trials of lamivudine, it was found that, compared with placebo, HBV mutant variants were associated with reduced sensitivity to lamivudine in approximately 30% of patients after only a year of treatment17,18. In addition, mutations causing resistance to adefovir dipivoxil were detected in up to 20 to 29% of patients after five years of treatment19,20. However, more-recently developed nucleotide analogues, such as entecavir and tenofovir disoproxil, have been reported to dramatically decrease the rate of development of resistance.

In the case of entecavir, only 1.2% of patients developed a resistant strain after five years of treatment if they had never been treated with nucleotide analogues previously21. For tenofovir disoproxil, there were no clinically significant resistant variants identified during up to seven years of follow-up22. Cross-resistance is also known to occur and is another concern for the development of drug resistance. For example, between lamivudine- and entecavir-resistant HBV strains, the cumulative probability of developing entecavir resistant variants is more than 50% for those individuals with lack of resistance toward lamivudine21.

In order to create new therapies that can circumvent or negate mechanisms of drug resistance, we need to understand resistance mutations better with the aid of analytics tools such as text mining. Text mining has been applied to extract evidence for generic drug resistance23,24, impact of mutations in disease25 and to harvest viral mutation data26. It has also been leveraged to create comprehensive databases of the viral mutation literature27,28. There is no report, however, that it has been used to produce new insights that support clinical data analysis of viral mutations with integration of literature data. This study aimed to build on a clinical study that ultra-deep sequenced HBV quasispecies from a European cohort29 by comparing the data from this clinical study against genetic mutation information automatically extracted from the scientific literature regarding all known launched anti-HBV drugs. Such data are typically scattered across many publications; therefore, we built a pipeline to mine literature sources systematically. In order to extract mutations from the literature we applied a rule-based text mining approach30,31,32. With our approach, we were able to produce a landscape of disease-specific viral mutations associated to drug resistance, which allowed us to derive new insights into mechanisms associated to drug resistance.

There exist manually curated databases33,34,35,36 that aim to gather HBV resistance mutations from patients' data. However, it is hard to evaluate their coverage since they do not give a complete list of resistance mutations and instead provide a mean to analyze the genetic sequences that the user has obtained. In addition, most of these databases are not open source. Hence, the use of text mining approaches is able to mine through the literature efficiently are essential to tackle the challenge of identifying known resistance mutations.

Methods

Code and supplementary information

The analysis was performed using standard Python packages (e.g., ElementTree37 for parsing XML-formatted literature, re for identifying information about mutations, and pandas for data analysis). The resulting code and supplementary information (Supplementary Figures S1S15, Tables S1S5, and Note S1) is available at: https://github.com/angoto/HBV_Code.

Source of data

Mining of the scientific literature was conducted by using all 2,472,725 XML files from PubMed Central (commercial and non-commercial use), which consisted of information about each publication such as title, authors, DOI, abstract, full body content, and references38. Clinical data for genetic mutations of HBV was collected from a total of 186 plasma samples from a Western European cohort of chronic HBV patients during the period from 1985 to 201229. HBV DNA was extracted from 200 μL of plasma using the QIAamp MinElute Virus Kit (Qiagen) or the Roche MagNA Pure LC instrument according to the manufacturer’s instructions. These samples are stored at the Erasmus University Medical Center in Rotterdam, the Netherlands. The guidelines followed in the study were in accordance with the Declaration of Helsinki39 and the principles of Good Clinical Practice40. The study was approved by the ethical review board of the Erasmus Medical Center, Rotterdam, the Netherlands. The clinical data used in this study was anonymized and permission was not required to access the raw data of the clinical study. The dataset used to build the anti-HBV drugs’ vocabulary included 69 launched anti-HBV drugs obtained from the Cortellis Drug Discovery Intelligence database (CDDI), formerly known as Integrity41. We extracted each drug’s code name, generic name, brand name, and drug name from the CDDI database to enrich this vocabulary, which gave 312 unique drug expressions.

Scientific literature mining

The workflow in Fig. 1 was used to search through the sentences from journal articles in PubMed Central (PMC) that mentioned both an HBV drug and a mutation. In Step 2, we have used GNU parallel in order to select relevant literature that refer to ‘hepatitis b’ and/or ‘hbv’ (both keywords are case insensitive). In Step 4, a regular expression was used to split sentences from PMC and mutations were identified by using regular expressions for HBV mutations, as described in the next section. We avoid any problems with abbreviations that could lead to false positives, such as confusing the viral gene “C” with the element “C”, by focusing our search for specific mutations found in the clinical study, and ensuring co-occurrence of HBV drugs in the same sentence. By basing our rules on co-occurrence relationships between drug and mutation, our approach is generalizable to new data, and new diseases. The output data from this workflow was later used to conduct the data analysis (i.e., Step 5 in Fig. 1). Further details of Steps 1–5 are available at: https://github.com/angoto/HBV_Code.

Figure 1
figure 1

Workflow for mining the scientific literature.

Regular expressions for HBV mutations

Mutations relevant to HBV were found from articles in PubMed Central by combining regular expressions.

Use of clinical study data

Six different pieces of data from the Western Europe cohorts described in Mueller-Breckenridge, et al.29 were used to conduct the analysis: sample, reference genome, gene, effect, amino acid variant, and nucleotide variant. “Sample” identifies the unique patients’ number used during the clinical study. Reference genome refers to a genotype-specific reference sequence (namely, GenBank accession numbers AF090842, AB033554, AB033556, AF121240, and AB032431, for genotypes A to E, respectively) that was used to map the outputs generated from quality-trimmed demultiplexed FASTQ reads29. “Gene” indicates four known genes encoded by the genome: C, P, S, and X. “Effect” refers to whether a particular mutation was either an intragenic or a missense variant. For this report, we have only considered missense variants because intragenic variants from the literature were not associated with the clinical study’s phenotypes. “Amino acid variant” (n = 7285) and “nucleotide variant” (n = 10,658) are two different ways to describe the same mutation, i.e. mutation in the amino acid (e.g. p.Pro156Ser) and nucleotide (e.g. c.314C > T) sequences, respectively. Although it is relatively straightforward to obtain HBV-related genetic mutations from the literature, it is difficult to map the position number of nucleotides to the position number of amino acids, and vice-versa. This is because we are unsure about the reference sequence that was used to report these mutations for each journal article in PubMed Central. In order to do this type of mapping with our current approach, we would need to manually go over each journal article that mentioned both an HBV drug and a mutation to conduct the translation of DNA sequences to amino acid sequences because an ambiguous inverse mapping would be impossible (except for Tryptophan). Hypothetical clinical study data is available on the GitHub repository for this study42 to provide a better idea of the data structure used in the clinical study.

Comparison of scientific literature and clinical study

The workflow in Fig. 2 was used to compare mutations from the clinical study with the genetic mutations found in the scientific literature downloaded from PubMed Central after translating the mutations from the literature to the format used in the clinical study: amino acid (category vi) and nucleotide (category ii) variants (n = 4214).

Figure 2
figure 2

Normalization of DNA and amino acid mutations found from text mining the literature, to enable the direct comparison with clinical study mutations.

Mutation hotspot for entecavir

We independently built our own models of entecavir bound with HBV RT and DNA, using the X-ray crystal structures of HIV RT also bound to entecavir and DNA, from PDB entries 5XN143 and 6IKA44. Homology modelling was performed using SWISS-MODEL45, and UniProt ID Q9WRJ946 residues 349–692 for the sequence of HBV RT genotype A. The resulting model underwent energy minimization using the open-source build of PyMOL version 2.3.047, relaxing the entecavir ligand and all residues and bases within 4 Å of entecavir.

Results

Comparison between the clinical study and the literature

The number of publications found to have a sentence co-occurrence of an approved HBV drug and a mutation in the same sentence was 30,686. There were 4,214 unique mutations mentioned in those sentences and, 7.5% of those, a total of 316 mutations (i.e., 254 amino acid and 62 nucleotide variants) were also found in the clinical data from Western Europe cohorts (n = 182 patients, 31,977 mutations). After analyzing the clinical data using solely those unique mutations that were common between the clinical study and the literature, we found that 180 out of 182 patients (corresponding to 1750 amino acid and 302 nucleotide variants) had mutations that also appeared in the literature.

Prevalence of genetic mutations in the clinical study

Figure 3 shows the ten most frequent amino acid (left) and nucleotide variants (right) in the clinical study that were also found in the literature. Their prevalence ranged between 1 and 52 patients for amino acid variants and between 1 and 78 patients for nucleotide variants. Supplementary Figures S1S2 show the frequency of mutations in the clinical data (i.e., number of patients reported to have each mutation type).

Figure 3
figure 3

Top 10 most frequent amino acid mutations (left) and nucleotide mutations (right) in the clinical data that were also found in the literature.

Mutation likelihood of patients in the clinical study

The 180 patients presented between 16 and 753 HBV mutations. Out of these mutations, between 1 and 35 appeared in the literature (Supplementary Figure S3-1), with an average of 11.2 mutations and a median of 10 mutations. Supplementary Figure S3-2 shows the patients in the clinical study who had the most mutations matching those found in the literature.

Mutation hotspots for HBV genotypes in the clinical study

We identified amino acid mutations in the four genes X, P, C, and S, across five HBV genotypes, A, B, C, D, and E, that emerged during the clinical study. There were 182 patients in the clinical study, of which there were 56 patients with genotype A, 19 with genotype B, 43 with genotype C, 62 with genotype D, and 2 with genotype E. In order to create a mutation hotspot map for each of these genotypes (Supplementary Figures S4S8) where there was at least one report in the literature of that mutation, overlaps between the mutations from the literature and the clinical study mutations were identified. It should be noted that there are mutations in the patients that were present in the clinical study but were not found in the analysis of the literature, and hence are not shown in these heatmaps.

We defined a “mutation hotspot” as a mutation with a count above the average for that particular gene and genotypes. For example, for HBV RT genotype A, mutation V214A had a count of 1 patient, while the average of the counts was 5.18 patients, so this was excluded from the map; while mutation L217R had a count of 30 patients, which was above the average for HBV RT genotype A, and thus was included as a mutation hotspot. For nucleotide variants, similar plots were made based on the genotypes (A-E), with mutations sorted in order of position number (Supplementary Figures S9S13). Figures 4, 5, 6 represent heatmaps of the mutational hotspots that we identified for eight different gene products (i.e., polymerase, reverse transcriptase, X, precore, core, PreS1, PreS2, and HBsAg) and nucleotide variants, with darker, more saturated colors indicating more mutations at that position in that genotype. The summary tables of hotspots for each HBV genotype for amino acid and nucleotide variants are shown in Supplementary Tables S1S5.

Figure 4
figure 4

Heatmap of mutation hotspots for gene P in genotypes A-E: (a) HBV polymerase, and (b) HBV reverse transcriptase (RT) and heatmap of mutation hotspots for gene C in genotypes A-E: (c) HBV precore, and (d) HBV core. The white regions represent counts of mutations that are null, while dark blue (in a and b) or dark green (in c or d) indicates the largest number of mutations.

Figure 5
figure 5

Heatmap of mutation hotspots for gene S in genotypes A-E: (a) HBV PreS1, and (b) HBV PreS2; heatmap of mutation hotspots for: (c) HBV HBsAg (gene S), and (d) HBV gene X in genotypes A-E. The white regions represent counts of mutations that are null, while dark red indicates the largest number of mutations.

Figure 6
figure 6

Heatmap of mutation hotspots for HBV nucleotide variants in genotypes A-E. The white regions represent counts of mutations that are null, while dark blue indicates the largest number of mutations.

Our analysis revealed there were “genotype-common” mutation hotspots, i.e., mutations with the same position number, in two or more genotypes (Figs. 4, 5, 6), above the average count of patient. These were: polymerase at position K293; reverse transcriptase at positions S213, S219, D263, and Q267; PreS1 at positions S5, F141, and N214; PreS2 at position R16; HBsAg at positions E2, L77, K122, P127, M133, Y161, W182, A184, S204, S210, and F220; X at position P33; precore at positions G29, L66, and S78; core at positions P5, T12, L60, and T147; and nucleotide variants at positions 34A, 120A, and 389A.

Thus, although we were unable to discriminate HBV drug-resistant mutations unambiguously from other mutations, we were able to identify common mutations in the clinical study and the literature, across genotypes A-E. This can further our understanding of HBV mutations by highlighting mutations that may be potentially relevant to resistance.

Mutations occurring in disulfide bonds

We investigated the number of common mutations between the clinical study and the literature that arose from mutating wild type cysteine residues. Cysteine residues can be responsible for the formation of disulfide bonds, which play an important role in folding and stability of the proteins48. The HBV envelope proteins, which corresponds to gene S, are known to form an intermolecular disulfide network through cysteine residues in the cysteine-rich antigenic loop that are in positions 102 to 16149,50,51. Hence, mutations for cysteine residues reported in Table 1 for positions C125 and C149 are in fact in regions of disulfide bonds. For gene C, the cysteine residue in position C48 is expected to form a disulfide bond with C14952. We were unable to find any evidence of a disulfide bond involving residue C1 in gene S.

Table 1 Cysteine residues in genotypes A-E for common mutations between the clinical study and the literature with corresponding genes labelled in parentheses (i.e., gene S, and gene C).

Therefore, by comparing the literature and the clinical study, we were able to identify which gene, and in what particular position, mutations occurred at disulfide bond. We found mutations located at C124 and C149 in gene S to occur in regions of disulfide bonds, which are likely to have an effect towards destabilizing the HBV proteins. Table 1 shows the list of cysteine residues in each genotype, A-E, for mutations that were common between the clinical data and the literature.

Mutation hotspot for entecavir

Based on a search conducted using DrugBank16, one out of 69 approved drugs listed in the CDDI database had a modelling study published with a known target and binding site53. By using our own models of entecavir bound with HBV RT and DNA, we identified which of these binding pocket residues were common between the clinical data and the scientific literature. For entecavir, we found common mutations located at I169, M204, and N238 for genotype A (within 12 Å of entecavir in our model).

Figure 7 shows the binding pocket of entecavir, together with the binding pocket residues within 5 Å of entecavir. The original modelling study by Langley, et al.53 used a sequence alignment between HIV and HBV reverse transcriptase to determine the most conserved domains, and they used the HIV reverse transcriptase DNA X-ray structure (PDB ID 1RTD)54 to build the HBV RT model.

Figure 7
figure 7

Model of HBV RT (genotype A) based on the HIV RT-DNA-entecavir complex X-ray crystal structure, PDB ID 5XN143. (a) Entecavir-triphosphate is shown as spheres with teal-colored carbon atoms, red oxygen, blue nitrogen, and orange phosphorus atoms. Entecavir is at the 3′ end of the DNA strand (orange cylinders). The secondary structure of the HBV RT model is shown as white coils (\(\alpha \)-helices), white arrows (\(\beta \)-strands), and white loops. (b) Close-up view of the binding pocket, showing entecavir-triphosphate with teal-colored carbons, and a Mg2+ cation as a green sphere. Note that hydrogen bonds and metal bonds are shown as light blue dashed lines. The deoxyguanine (dG)-moiety of entecavir can be seen “base-pairing” with deoxycytosine (dC) in a second strand of DNA.

We also investigated the location of the mutational hotspots we identified for HBV RT in genotype A and mapped these to our model. It can be seen from Fig. 8 that the mutations occur throughout the structure.

Figure 8
figure 8

Locations of the mutations in HBV RT found both in patients in the clinical study, and in our analysis of the literature, shown as spheres placed at the α-carbon atom of the amino acid. The pink spheres are mutations with above-average counts of patients, and are at positions 129, 163, 213, 217, 219, 263, and 267; while the light blue spheres indicate positions of mutations below the average threshold. The secondary structure of our model of the HBV RT is shown as a white cartoon, while the DNA backbone is in orange. The location of the DNA, entecavir-triphosphate (shown with teal carbons), and Mg2+ (green sphere) was modelled on PDB ID 5XN143.

Monitoring of drug resistance driven by viral mutations is relevant for antiviral drugs beyond those used for HBV, such as human cytomegalovirus (HMCV)55, HIV56, hepatitis C virus57, influenza virus58, SARS-CoV-259 and more. Hence, the method proposed could be useful for a range of antiviral drugs and associated diseases.

Discussion and conclusions

This study showed the feasibility of integrating clinical and text mined viral mutation data and its potential to produce new insights on disease-relevant viral mutations. The focus of the study was on HBV, but the methods are applicable to other viruses. While mutational hotspots can be identified by considering only the clinical study’s mutations, the difficulty associated with this process is that, even when new mutational hotspots are identified through the study, they are often disregarded because there are no literature reports available or there are not enough resources to comb the literature to confirm its role in resistance mutations. Thus, it is typical for clinical studies to report only those mutational hotspots that have been reported in the past. Hence, the methodology presented in this study serves as a bridge to fill in that gap by leveraging the full text of 2.47 million articles from PubMed Central to produce a landscape of disease-specific viral mutations. Although it is not possible for this method to confirm a direct correlation between the mutational hotspots and resistance mutations, it is able to provide new hypotheses that are potentially relevant to the resistance phenotype.

Our study identified known hotspots that were found in four genes, P, C, S, and X, in genotypes A to E, as shown in Figs. 4, 5, 6. There are other HBV mutational hotspots that are known such as L180, which is a compensatory mutation for resistance to entecavir, lamivudine, and telbivudine60; rt202 which causes resistance to entecavir in Asian population61; and rt236 which is responsible for resistance to adefovir dipivoxil in Caucasian62. Hence, the list of mutational hotspots that were identified in this study was not necessarily conclusive. In addition, two amino acid positions (i.e., C124 and C149) exhibited mutations in disulfide bonds, which are likely to impact the structure of HBV proteins. This finding is confirmatory since these regions of the disulfide bonds have already been identified in the past49,50,51. Moreover, we identified a mutation in a position that was relevant to the binding site for the anti-HBV drug, entecavir. However, the I169 mutation, which is known to be the primary mutation responsible for entecavir’s resistance was found to be reported below the average threshold when comparing mutational hotspots between the clinical study and the literature63. Therefore, further refinement of the definition of mutational hotspots used in this study may be necessary in the future.

One limitation of our approach lies in terms of the amount of data we collected from the literature. This resulted in an overall coverage of the number of genetic mutations reported in the literature against the common mutations of the literature and the clinical study of 7.5%. In order to further increase the coverage of genetic mutations data, literature sources could be increased to include repositories from publishers such as ScienceDirect64 and Springer Nature65. We could also obtain additional data from the PubMed Central literature by mining the information included in tables and figures. It is also important to be aware of errors that may arise during DNA sequencing and patient data collection during the clinical study.

In addition, we considered the rate of false positive identifications of mutations. An example of a false positive would be the following sentence: “Some non-resistance-associated mutations of rtD134N (ranging from 20.33 to 74.63%), rtL145M (ranging from 2.83 to 78.82%), rtF151Y (ranging from 2.92 to 75.51%) and rtS223A (ranging from 5.77 to 18.44%) increased significantly with ADV monotherapy, then declined with the addition of LdT”66. In this sentence, it satisfies the criterion for co-occurrence of an HBV drug and a mutation, but it is in fact not referring to resistance mutations. False positives could also be considered hypothetical mutations, such as the one described in the sentence: “The lamivudine resistance-linked G529A (rtD134N) site in HBV was found to be associated with HCC outcomes, which implied potential correlation between resistance to the anti-HBV nucleoside analog lamivudine and HCC prognosis”67. Furthermore, it is possible that some drug-resistant mutants involve multiple locations, so in the future we must consider such combinatorial possibilities as well. To estimate the false positive rate, we selected 50 random sentences with co-occurrence of drug and mutation and inspected them manually. It was found that two of them were false positives. Hence, we estimate the rate of false positives to be approximately 4% of the overall data.

To improve the methods used here, additional text mining strategies could be used to improve the extraction, such as using additional synonyms from UMLS and SNOMED for both the drugs and the disease, which would decrease the number of false negatives. By using machine learning algorithms, thus we could also increase our confidence that sentences describe HBV-related information. For example, the following sentence refers to two mutations, M204I and M204V, which are not captured by simple regular expressions: “LVDr is well characterized and arises through replacement of M204 within the YMDD motif of the HBV RT with isoleucine or valine, with or without the adaptive change L180M”46. By performing the literature mining in this manner, we could be able to develop a more powerful text mining tool that would allow us to identify a more comprehensive set of resistance mutations. However, it is important to note that, while more-recently published mutation-extraction approaches are machine-learning based68,69, these require costly gold standard corpora, which hinders their re-application to different viral species, each with its own notation particularities. Thus, the development of machine-learning algorithms may lead to improved performance but could result in more restricted application to a particular disease.