Introduction

Over the last decade, next-generation sequencing has revolutionized the world of rare diseases. After an extensive effort by researchers to identify new genes responsible for human diseases, clinical whole-exome sequencing (cES) is now currently used in the clinical setting for heterogeneous and rare genetic disorders, it leads to a positive diagnosis for about 30% of individuals with intellectual disability (ID) and/or multiple congenital anomalies (MCA) [1,2,3,4,5]. Two-thirds of these patients therefore remain without a molecular diagnosis after cES. This diagnostic yield is limited by the stringent criteria of the ACMG (American College of Medical Genetics and Genomics), which recommends restricting variant interpretation to the genes responsible for human diseases [6] mostly reported in the OMIM database, using a routine practice of wide diagnostic laboratories.

The exponential increase of scientific and genomic knowledge means that new genes are regularly linked to rare diseases with ID and/or MCA, which has resulted in ~30–60 new entries and 300–900 updated entries per month in the OMIM database (https://www.omim.org/statistics/update, 20th February 2019) [7]. The need for periodic reanalysis of ES data in undiagnosed patients therefore appears obvious. A recent reanalysis study, which diagnosed 10–15% more cases, endorsed systematic diagnostic reanalysis or reanalysis upon patient request to increase diagnostic yield [8]. However, in the vast majority of cases, successive cES analyses of OMIM disease-causing genes are inadequate. Such analyses could be enriched by the significant number of disease-causing genes published in the scientific literature and not referenced in the OMIM database. Indeed, with more than 250 genes associated with human diseases and more than 9200 novel pathogenic/likely pathogenic variants reported annually, knowledge moving faster than the data can be updated in the OMIM database, and a number of diagnoses are subsequently lacking [3, 8,9,10,11,12]. Moreover, information about variants has been updated in databases, such as OMIM or ClinVAR, because the reanalysis of ES data results in the reclassification of previously described variants. These changes are mainly variants of uncertain significance reclassified as affect functions or benign variants [3].

The reanalysis and reinterpretation of ES data in a research setting are made possible by a wide variety of tools and databases, and the abundant knowledge available in the scientific literature. Some resources can easily be implemented to the data workflow to help with the interpretation of ES data, including in silico predictive scores (PolyPhen, CADD, Grantham, GERP, http://mendel.stanford.edu/SidowLab/downloads/gerp/, SIFT, http://sift.jcvi.org/, and pLI), allele frequency in the population (EVS, http://evs.gs.washington.edu/EVS/, ExAC, and GnomAD), or in silico tools for splicing defects (HSF, http://www.umd.be/HSF/) [13,14,15].

A few studies that have used a range of currently available tools to reanalyze the cES data in the research environment have successfully increased diagnostic yield [3, 8, 16]. The diagnostic yield was dependent on the strategy and varied widely from study to study. Nambot et al. performed a systematic annual reanalysis of singleton ES and obtained 15% more diagnoses (24 individuals) from new and recently published genes. The combined results of these studies indicated that the OMIM database is slow to be updated and the drawbacks of restricting analyses to OMIM disease-causing genes were clearly demonstrated. Another effective strategy for accelerating the identification of new disease-causing genes is the use of trio-based ES. Eldomery et al. had a significantly higher diagnostic yield using trio-ES data in research analysis, with more than 50% of likely contributory genes, including candidate genes without recurrence [16]. These studies evidence the clear advantage of extending exploration to non-OMIM disease-causing genes, though further analyses will be essential to confirm the preliminary results [17].

In this study, singleton-ES data from 313 individuals with ID/MCA and negative cES were reanalyzed in a research setting. Our findings supplement a previously published study in the diagnostic setting [8]. We discuss the potential gains and consequences of reanalyzing ES data in individual care and for scientific knowledge.

Patients and methods

Patients

Singleton-ES data were obtained from a cohort of 313 unrelated patients, from 313 families, referred to the Reference Center for Congenital Anomalies and Malformative Syndromes in Dijon (France), or the Orphanomix units for genetics testing located in several hospitals in France. These data were reanalyzed in a research laboratory between July 2013 and December 2017 (Fig. 1a). The local ethics committee approved this study.

Fig. 1
figure 1

Strategy of ES data analysis, databases, and tools used for variant interpretation in a research environment and the global results of this study. First, we filtered on the suspected mode of inheritance. Then, a large research, including literature and public databases associated with in silico scores, helping at the interpretation or indicates a possible link with the disease, allowed to select candidate genes. Finally, accordingly with the parental segregation, the variant can be shared in international platforms for the purpose of the identification of additional affected cases to confirm or rule out the candidate gene

Whole-exome sequencing research analysis and interpretation

Each research analysis was done immediately after the cES results were obtained. Rapid deployment is facilitated by our local translational integrative organization that combines a unit focused on diagnostic innovation and a research team. ES data are typically analyzed in two steps: (1) diagnostic interpretation restricted to disease-causing genes reported in the OMIM database, and (2) reanalysis in a research environment, including all detected variants. In each stage, a multidisciplinary team is implicated in the interpretation of ES data and some experts are involved in the two steps.

We used all of the ES data initially produced for the singleton cES analysis [8]. BAM files had been aligned to a human genome reference sequence (GRCh37/hg19) using BWA (Burrows–Wheeler Aligner; v0.7.15). All aligned reads underwent the same procedure: (i) duplicate paired-end reads were removed by Picard 2.4.1, (ii) indel realignment, and (iii) base quality score recalibration were done on the Genome Analysis Toolkit (GATK; v3.7). Variants with a quality score >30 and an alignment quality score >20 were annotated with SeattleSeq SNP Annotation (see Web resources). The annotation includes data of several public databases, such as HGMD (http://www.hgmd.cf.ac.uk/), ClinVAR (https://www.ncbi.nlm.nih.gov/clinvar/), and COSMIC (https://cancer.sanger.ac.uk/cosmic). CNV was detected with XHMM software (https://www.atgu.mgh.harvard.edu/) and annotated, using chromosomal coordinates of coding exonic sequences on the human genome (https://www.ncbi.nlm.nih.gov/refseq/). Rare variants present at a frequency above 1% in DGV, GnomAD Browser, ExAC Browser, and the NHLBI GO Exome Sequencing Project or present in 100 local exomes of unaffected individuals were excluded (see URL).

In the reanalysis for research purposes, we extended variant interpretation to genes not associated with human disease in the OMIM database. We also extended to atypical phenotypes unrelated to well-known genes in the OMIM database in prioritizing (i) biallelic variants, (ii) truncating and splice site variants, (iii) homozygous missense variants and in-frame deletion/insertion, and (iv) heterozygous missense variants and in-frame deletion/insertion. We first looked at a gene recently recognized in the literature as disease-causing but not listed in OMIM, and then we turned our attention to genes unknown to cause human diseases yet. To help with the interpretation, we used public databases listing variants or genes previously reported in human diseases, animal models (such as mouse, zebrafish, or rat), and gene expression data, and the impact of the variant in protein structure or function, protein interactions, or signaling pathways. We also used bioinformatics tools, including predictive scores of pathogenicity, conservation, or impact of missense and truncating variants (Fig. 1). We performed a systematic review of the literature to identify isolated cases, recent description of new genes, or functional data. In this study, cES data have been analyzed by two experts, and all of the results and candidate genes were presented and discussed in multidisciplinary assembly.

Variant validation and parental segregation

Candidate variants and parental segregation were confirmed by Sanger sequencing for SNV and quantitative PCR for CNV.

Genomic DNA was amplified by polymerase chain reaction (PCR) for Sanger sequencing, using HotStarTaq PCR kit (Qiagen) according to the manufacturer's protocol. PCR products were purified with the Agencourt CleanSEQ system (Beckman Coulter) and sequenced with the BigDye Terminator Cycle Sequencing kit, v3.1 (Applied Biosystems) in an ABI 3730 sequencer (Applied Biosystems). Sequence data were analyzed with Mutation Surveyor v4.0.9 (Softgenomics).

Genomic DNA was amplified for quantitative PCR with SYBR Green Master Mix kit according to the manufacturer's protocol.

Data sharing

For each candidate variant in a candidate gene, unknown in rare diseases or known but with a new phenotypic presentation, we actively searched for additional similar patients to confirm the genotype–phenotype correlation (Fig. 1). Multiple strategies were used to improve data sharing: exchanging information by e-mail, presenting case-report results in international congresses, scrutinizing a large series of ES studies in the literature or in public databases (i.e., DECIPHER, https://decipher.sanger.ac.uk/, de novo-db, http://denovo-db.gs.washington.edu/denovo-db/, and ClinVAR, https://www-ncbi-nlm-nih-gov.gate2.inist.fr/clinvar/), and international data-sharing platforms like GeneMatcher (https://www.genematcher.org/). All variants identified in new disease- causing genes have submitted in ClinVAR databases (SUB2871008, SUB2871014, SUB3604471, and SUB3731210).

Results

The cohort included 241/313 children (77%) with an average age of 9 years, and 72/313 adults (23%), including 185/313 males (59%) and 128/313 females (41%). Eleven patients were born to reported consanguineous parents. The ethnic origin was European (93.2%), African (6.5%), or Asian (0.3%). The individuals presented with isolated ID/epileptic encephalopathy (EE) (49/313 individuals), ID with MCA (93/313 individuals), or multiple congenital anomalies (MCA) without ID (171/313 patients) (Fig. 1). MCA included abnormality of brain morphology (HP: 0012443) (11%, 29/264 individuals), abnormality of calvarial morphology (HP:0002648) and abnormality of the face (HP:0000271) (57%, 150/264 individuals), abnormality of skeletal muscles (HP:0040290) (49%, 129/264 individuals), abnormality of the skin (HP:0000951) (20%, 53/264 individuals), abnormality of the gastrointestinal tract (HP:0011024) (17%, 45/264 individuals), abnormality of the genitourinary system (HP:0000119) (15%, 40/264 individuals), abnormality of the cardiovascular system (HP:0001626) (19%, 50/264 individuals), and abnormality of the respiratory system (HP:0002086) (13%, 34/264 individuals). Neurological features (142/313 individuals) included seizures (HP:0001250) (33%), global developmental delay (HP:0001263) (59%), ID (HP:0001249) (72%), and autism spectrum disorder (HP:0000729) (13%). Patients also had some abnormality of the endocrine system (HP:0000818) (6%), immune system (HP:0002715) (3%), or unspecific metabolic anomalies (3%). Detailed anonymized clinical data were available in the PhenomeCentral database using the standardized HPO (Human Phenotype Ontology) terms (https://www.phenomecentral.org/).

All individuals had received a negative or non-conclusive result after array-CGH and singleton cES analysis in the diagnostic setting. Of the 313 individuals, 133 were cases without a molecular diagnosis, who had been the focus of a previous study and 14 were individuals who have a candidate gene previously reported in the case of additional research strategy [8].

After Sanger validation and parental segregation, the singleton-ES research analysis identified 84 candidate variants in 66 different genes in 73/313 individuals (23%) (Supplementary Table 1 and Table 1). These 84 variants included 55 missense variants (66%), 22 truncating variants (27%), 5 variants predicted to affect a splice site (6%), and 1 CNV (1%). The 84 variants were mainly de novo autosomal dominant (40%) or recessive (29%), and included 16 homozygous and 9 compound heterozygous variants. Of the 7.3% of X-linked variants, three occurred de novo and three were inherited from an unaffected mother. Three autosomal- dominant variants were confirmed by Sanger sequencing, but the parental segregation is unknown, because parental DNAs are not available (Fig. 2, Supplementary Table 1 and Table 1).

Table 1 Summary of the 26 candidate genes identified by research ES reanalysis in this study
Fig. 2
figure 2

Repartition of the mode of inheritance and type of variant in validated disease-causing genes (a) or in candidate genes (b)

All 66 candidate genes were shared with the national and/or international scientific and medical communities. We submitted the individual genotypes, mainly new candidate genes or genes with low recurrence and/or associated with an atypical phenotype to the GeneMatcher platform. A match was found for 16 submitted genes. Our data-sharing strategies confirmed that 40/66 genes as responsible for disease (Supplementary Table 1) and the remaining 26/66 are currently considered candidate genes (Table 1), because there was no or insufficient recurrence (Fig. 2). The number of recurrence to confirm the involvement of a gene in a disease is estimated to three unrelated cases with homozygous or compound heterozygous variants in autosomal recessive phenotypes, and five affected cases with heterozygous variants for autosomal-dominant phenotype [18]. In the absence of insufficient recurrence, the genes remain a candidate with a variant of uncertain significance. Many affected function variants were missense variants (57%), most frequently identified in the DI/EE cohort (75%) (Figs. 2a and 3a). The autosomal or X-linked sporadic variants appear to affect function, while half of autosomal-recessive or X-linked inherited variants remain variants of uncertain significance (Figs. 2a and 3d). The vast majority of nonsense and truncating variants (about 75%) were confirmed responsible for diseases, and around 40% of missense and splice site variants remained candidates (Fig. 3e). Data sharing resulted in national and international collaborations for 21 genes (27 patients), 15 of which were used for functional studies. Our results have led to 15 scientific publications, and 12 papers are currently in progress, 6 of which are being led by our team [19,20,21,22,23,24,25,26,27,28,29,30,31,32,33].

Fig. 3
figure 3

Repartition of the type of variants (a), classification of identified genes (b), and conclusion (c) in MCA and ID/EE cohorts. Percentage of inconclusive with variants of uncertain significance or positive diagnosis by the mode of inheritance (d) or type of variants (e)

Overall, we confirmed the involvement of 40 different genes in human disease in 48/313 individuals (15%). These 48 cases provided a molecular diagnosis for 30% of our MCA and ID cases, 19% of our isolated ID/EE cases, and 6% of our MCA without ID cases. The 40 disease-causing genes can be classified as (1) 17 new genes unknown in human disease; (2) 22 previously published genes with low recurrence and not yet referenced in the OMIM database (https://www.omim.org/); (3) 1 established OMIM gene with a new previously unrelated phenotype (Fig. 2a).

The results were of uncertain significance in 26/313 individuals (8%) who made up 7% of our MCA and ID cases, 10% of isolated ID/EE cases, and 8% of the MCA without ID cohort (Fig. 3c). Indeed, the 26 genes are now candidates because of lacking recurrence (Table 1), and were mostly missense variants (81%), illustrating the difficulties to interpret the functional impact of missense variants in the gene only based on parental segregation and in silico scores in the absence of recurrence after data sharing (Fig. 2b).

Discussion

Research reanalysis after negative singleton cES has demonstrated its ability to rapidly improve diagnostic yield and scientific knowledge. Using this approach, we confirmed the involvement of 40 different disease-causing genes in 15% of individuals with negative cES. We were thus able to increase the positive yield from 30% with diagnostic cES analysis alone to 40% with extended singleton-ES reanalysis in a research setting.

Identification of new disease-causing genes improves scientific knowledge

Analyzing singleton-ES data for research purposes provides a valuable opportunity to identify new disease-causing genes. We identified 17 new genes in 21 individuals (Supplementary Table 1). Among these 17 genes, 9 genes presented autosomal-recessive truncating and/or missense variants and 7 presented de novo missense variants (Fig. 2a). Five genes were selected because of de novo truncated variants in a gene with a highly loss-of-function intolerance (pLI = 1 in ExAc database; http://exac.broadinstitute.org/; http://gnomad.broadinstitute.org/). Of the 17 new genes, 11 were confirmed as involved after significant recurrence through data sharing, via GeneMatcher (6/11) or other data-sharing strategies (5/11), including congresses and national or personal networks. Strategies combining research reanalysis and international data sharing foster national and international collaborations (21 collaborative projects) and have improved scientific knowledge of rare disorders. Our results have resulted in 15 papers in scientific publications [19, 22,23,24,25,26,27,28,29,30,31,32,33] and 12 papers are currently in progress.

Confirmation of previously published disease-causing gene

cES analysis is often restricted to disease-causing genes reported in the OMIM database, in which 5.102 disease-causing genes are associated with a genetic disease in OMIM database (updated 19th February 19, 2019). When next-generation sequencing-based technologies appeared, the number of known genes responsible for human diseases increased exponentially, and new genes are published every day in the scientific literature, in particular for ID/MCA [7]. Despite regular updates, the OMIM database remains incomplete and provides only limited resources for the diagnosis of rare diseases.

Access to all available scientific literature is essential, because it allows researchers to identify genes previously published but not yet recognized in the OMIM database [11]. Nineteen of our 40 involved genes from 7% of the 313 reanalyzed individuals, had previously been published but were not referenced in the OMIM database. TBR1 (MIM 604616) variants were identified in two patients with ID and autism. The TBR1 gene was first reported in patients with autism in 2012, and more than ten unrelated patients were later reported in different large cohorts of autistic individuals. Functional data and animal models also underlined the involvement of TBR1 in autism/ID, though it was not recognized in the OMIM database [34,35,36,37,38,39]. Our data sharing identified 20 additional individuals with ID and TBR1 variants, definitively establishing causality (Nambot et al., in press).

Regular updates to the OMIM database encourage prospective diagnostic reanalysis. This can lead to new diagnoses from recently identified genes that are progressively reported in the OMIM database, but the molecular diagnosis is generally delayed by at least 1 year [8]. Three of our 40 disease-causing genes (1% of the 313 reanalyzed individuals) had recently been described in the literature (Fig. 2, Supplementary Table 1). Frequent literature reviews and prospective updates of bioinformatics pipelines would ensure the diagnosis of rare diseases linked to recently identified genes.

Extension of the phenotype–genotype spectrum of well-known genes

Another fundamental challenge for variant interpretation is to provide a positive diagnosis for well-known genes, when the phenotype or genotype is atypical. Clinical heterogeneity is common in rare diseases, ranging from vast, overlapping clinical spectrums to completely different phenotypes [40,41,42]. In the OMIM database, 729/2664 disease-causing genes are associated with two distinct or overlapping phenotypes, and 235/2664 genes have more than four phenotypes, highlighting the complexity of correlating genotypes–phenotypes (updated 7th April, 2018). Phenotypic variability can be explained by the impact of variants, hypomorphic, or null alleles [41], gain or loss of function [40], and the genetic and/or environmental background [43, 44]. In ultrarare diseases, knowledge is limited by the lack of individuals with causal variants in the same gene. Research analysis could therefore identify new phenotypes linked to well-known genes, but only recognized for only one phenotype with different clinical presentations in the OMIM database. If there is no correlation with the OMIM phenotypic reference, reverse phenotyping limits the chances of delivering an initial diagnosis. In an individual affected with a newly identified severe Bohring–Opitz-like syndrome, ES revealed a homozygous variant in the KLHL7 gene (MIM 611119). KLHL7 was previously reported in cold-induced sweating syndrome 3, which is characterized by clinical features that partly overlap with our patient’s presentation (MIM 617055) [23]. We submitted the case to the GeneMatcher platform and at an international congress in the hope of identifying additional patients, and thus recruited five additional patients with a Bohring–Opitz-like presentation. We finally expanded the clinical spectrum of KLHL7 autosomal-recessive variants by describing a syndrome with features overlapping cold-induced sweating syndrome 3 and Bohring–Opitz syndrome [23].

Ongoing challenges in variant interpretation and candidate genes

Variant interpretation in ES research analysis is mainly based on suspected inheritance, phenotypic, variant and functional databases, variant prediction scores, animal models, accessible literature, and data sharing (Fig. 1). Alas, even with access to multiple data sources and the many tools designed to exploit them, many candidates remain (Fig. 2b).

The use of a singleton strategy for diagnosis has an economic advantage [8], but shows its limits when the analysis is extended to research. The absence of parental segregation reduces the ability to select candidate variants on the mode of inheritance, since de novo variants are confirmed to affect function in almost 90% of sporadic individuals. However, autosomal dominant or X-linked inherited variants remain variants of uncertain significance in 60–70% of cases, and in 50% of cases for recessive variants. Most inherited variants are variants of uncertain significance, because it is difficult to distinguish a causal variant from among all the inherited variants (~50% of variants), as well as incomplete penetrance and intra-familial variability, which have been described in some rare genetic diseases. Access to mutational parental segregation and detailed phenotyping would certainly be an advantage in this highly complex context.

The interpretation of missense variants also continues to be a challenge, even more so than truncating variants (40% of missense variants remained variant of uncertain significance). Missense variants can be located in functional or structural protein domains (CSNK2A1) or not (SAMD9, CLTC), clustered (PACS2, NACC1), or dispersed around the gene (NR2F1) with gain or loss-of- function impacts (Supplementary Table 1 and Table1). The absence of functional studies makes predicting the impact of missense variants difficult. Since missense variants make up 75–80% of all rare variants in the exome, the risk of identifying a missense variant as candidate variants is statistically increased. Algorithms have been developed to help with variant interpretation, but are not sufficient without biological validation (Fig. 1). This highlights the importance of the international effort to complete public variant databases, such as ClinVAR or DECIPHER [45, 46]. In addition to the missense variants, a large portion of splice site variants remained candidates (Fig. 3e).

Our work shows that data sharing is essential for establishing human genotype–phenotype relationships and conclusively classifying variants. The Matchmaker Exchange Initiative offers a data-sharing platform that can be used to match patients all over the world, according to their phenotypic or genotypic features [47]. Data-sharing tools increase diagnostic yield and should be actively used by the scientific and medical communities. In this study, data sharing was used to determine the involvement of 40/66 genes in rare disease, mainly for de novo missense variants. Low or no recurrence, however, did not allow us to establish the implication of 26/66 genes in human disease. Because these genes are candidates for ultrarare phenotypes and the number of clinicians or scientists sharing genotypic data is relatively small, additional cases have not yet been identified. In time, the increasing amount of data shared on international platforms should provide additional chances to conclude [25, 30].

Numerous cES remains negative because the cohort comprised various and heterogeneous disorders with the unknown etiology; thus, they can be included in non-Mendelian diseases or environmental/external causes of diseases [48]. In addition, the detection of variants remains, mainly depending of the sequencing technique, the exome capture kit, the exome coverage, and the bioinformatics pipeline used [49,50,51]. A reanalysis of cES data with an updated pipeline that reflects the lastest knowledge and the progression of bioinformatics software appears to be a determining factor in the identification of new variants during reanalysis, notably in low-covered genes, exon/intron boundaries, and noncoding sequences [48].

An effective strategy for increasing positive yield

This study demonstrates how singleton-ES research reanalysis can efficiently and rapidly increase scientific knowledge in rare diseases by identifying new disease-causing genes, implicating recent known genes not reported in the OMIM database, or extending the phenotype or genotype spectrum of well-known genes. This strategy is also a rapid means of obtaining a diagnosis, with positive results in 15% of individuals with negative cES, leading to nearly 11% of additional diagnoses in the initial cohort. The limits of the singleton strategy could be overcome with a second-step strategy based on trio ES. Complementary ES strategies increase the diagnostic ability of ES, and should be explored further and would be interesting to be quickly integrated after an initial diagnostic analysis in changing diagnostic laboratory practices in the strategy of exome analysis in the future.