Nearly a generation (~24 years) has elapsed since the identification of the breast cancer susceptibility genes, BRCA1 (ref. 1) and BRCA2 (ref. 2). Over that time the norms and policies surrounding the sharing of human genetic data have evolved. In this commentary, we examine the lessons learned about how data sharing can facilitate an understanding of the scope and consequences of genetic variation. Through this experience, we explore these lessons and their application to understanding human genomic variation.

The sharing of data among geneticists has waxed and waned through time. A notable nadir was reached during the race to identify the genes responsible for familial breast and ovarian cancer. The search for the BRCA1 gene was characterized by intense competition and shifting alliances.3 During the “gene hunt” phase, data sharing between (and even within) groups was minimal. After the BRCA1 gene was identified in 1994 (ref. 1), several of us called for a new, more open era to guide BRCA research in the future.4 A tangible outcome of this call was the creation of an open access database, the Breast Cancer Information Core (BIC), in 1995 (ref. 5). The mission of the BIC was to accelerate research by gathering and freely sharing information related to breast cancer genes. In particular, the BIC was established as a repository of germline variants in BRCA1 and BRCA2 (collectively, BRCA) in an effort to record all sequence variants and ensure that this information was freely available to the research community. The BIC has been in continuous operation for over two decades and has been cited in more than 2700 publications (https://research.nhgri.nih.gov/bic/).

Sharing human variant data: the early days

From its inception, the BIC used the then-new World Wide Web to share data with anyone with an Internet connection. The inspiration for using the web to distribute human genetic variant data came from the cystic fibrosis gene pathogenic variant database established by Lap Chi Tsui in Toronto.6 Perhaps the most well-known single-gene database at the time, this list of CFTR variants was distributed by Dr. Tsui to subscribers each month via fax. One of us (L.C.B.) sat near the fax machine and collected page after page as the CFTR “database” streamed onto the floor. In addition to saving paper, we thought that sharing information digitally would allow investigators to import and analyze the data directly.

The BIC website debuted in 1995. To place this event in context, the first widely used web browser, NCSA Mosaic, was introduced in the fall of 1993; Amazon, Inc. was established in 1994; and Google would not debut for another three years. The BIC was sharing data a year before the Human Genome Project proposed the Bermuda Principles, the plan that called for the prepublication release of genomic sequences (https://web.ornl.gov/sci/techresources/Human_Genome/research/bermuda.shtml).

The earliest BRCA data deposits were provided by researchers conducting sequence analyses of research participants. BIC was one of the first databases that provided free access to individual level, unpublished data, enabling the community to advance research and clinical studies.4 Later, as testing moved from research to clinical labs throughout the world, the latter became the main sources of data. For more than a decade, the main US testing lab, Myriad Genetics, freely shared their BRCA pathogenic variant data via the BIC. Myriad Genetics ceased contributing data to the BIC in 2006, and without Myriad, the volume of data being deposited decreased greatly and the main depositors were academic labs and non-US-based clinical labs. Data volume changed again in 2013 ("Shifting Landscapes" section below). In the last four years, more than 50 clinical testing laboratories have embraced an open access model and deposited tens of thousands of variants to public databases.7

The collaborative relationship between the BIC, testing laboratories, and researchers demonstrated the importance of capturing unpublished data directly from clinical labs; that is, it facilitates and expedites the classification of variants. For example, even in the absence of data on formal control samples, it quickly became clear that some missense variants, originally thought to be pathogenic, were actually benign population variants.8,9 This practice of data sharing, pioneered by the BIC, has expanded to other loci as well, as clinical genetic testing laboratories recognize the value of data sharing in moving the field forward.

Classification of variants of uncertain significance

During its first decade, the BIC’s main user base were scientists who found value in having easy access to BRCA variant data. Importantly, scientists were comfortable classifying variants as clinically significant, benign, or unknown. The BIC operating principles were to share data and have the scientific community determine the functional significance of each allele. This approach worked well until large numbers of clinicians, diagnostic laboratory staff, and even patients themselves registered to use BIC data. Of particular interest were variants of unknown significance (VUS), i.e., variants whose functional consequences were unknown. Such a clinical test result can be difficult to explain to patients and many clinicians are inexperienced in understanding the inherent uncertainty in genetic testing. The BIC Steering Committee recognized the VUS problem created by declaring a variant “uncertain” and developed a more consistent classification process managed by the steering committee. Classifications of clinical significance were made following discussions that weighed all available data and relied on member expertise and experience. This process was successful but resource-limited; therefore, a more robust and scalable approach was required.10,11,12

The Evidence-based Network for the Interpretation of Germline Mutant Alleles (ENIGMA)13 (https://enigmaconsortium.org) grew out of the BIC Steering Committee in 2009 to promote large-scale collaborative studies and standardized approaches to assess the clinical significance of BRCA1 and BRCA2 variants and other breast cancer susceptibility genes. The defining feature of the ENIGMA approach is the integration of multiple types of data.14 ENIGMA developed a set of likelihood-based rules for BRCA variant classification. These rules derive quantitative and qualitative measures by comparing the behavior of known pathogenic and nonpathogenic alleles with regard to multiple phenotypes, e.g., segregation in families, tumor pathology, associated cancers, and phylogenetic analysis. Conceptually, these are similar to the classification criteria for mismatch repair genes developed for inherited colon cancer15 and formalized by the International Society for Gastrointestinal Hereditary Tumors (InSIGHT)16 (http://www.insight-database.org/classifications/). A uniform structured classification criteria should result in objective variant classification. In this way, the hereditary breast and ovarian cancer and hereditary colon cancer research communities have been able to move beyond “expert opinion” as the main mode of variant classification. Open and transparent classification methods also create a community of professionals who initiate interlaboratory discussions when discordant classifications are reported. National organizations, such as the American College of Medical Genetics and Genomics (ACMG) and Association for Molecular Pathology (AMP) have developed their own guidelines to serve as a more generic framework for variant classification of Mendelian diseases. These recommendations are based on a structured review of different types of qualitative evidence with preassigned weights.17,18

Shifting landscapes

In the late spring of 2013, one technological advance and one judicial ruling irreversibly changed the landscape of genetic testing for susceptibility to inherited cancer. Technical progress came in the form of massively parallel sequencing technologies, which led to multiplexed DNA sequence-based testing. Tests could now easily include 5 to 50 putative cancer susceptibility genes for a lower cost than single-gene tests. The second event occurred in June 2013 when the US Supreme Court unanimously invalidated Myriad Genetics’ patents on the BRCA genes. In the United States, immediately after this ruling, new clinical labs entered the BRCA1 and BRCA2 test market. In this competitive environment, the cost of a combined BRCA1 and BRCA2 test dropped from ~US$4000 to less than US$400.

These changes in the testing landscape greatly increased the amount of BRCA sequence data being generated.19 Multiple commercial laboratories began sharing BRCA1 and BRCA2 variants from all patients with the BIC. The BIC curation pipeline could not process this volume. In response, the BIC began processing these new data in conjunction with the National Center for Biotechnology Information (NCBI). This represented a break from the past, when locus-specific databases (LSDBs) were curated by small groups of collaborators. Using the BIC as a model, NCBI created a new aggregation of LSDBs, dubbed ClinVar (https://www.ncbi.nlm.nih.gov/clinvar/). ClinVar now contains variant data for many clinically relevant genes, and includes all historical BIC data as well as newly sequenced variants for BRCA1 and BRCA2. Transferring the data acquisition, archiving, and display from the BIC to ClinVar has two advantages. ClinVar employs dedicated staff to process, curate, and display large data sets. In addition, as an integral part of the NCBI, ClinVar has a commitment to archive data permanently.

The need for expert panels

For patients undergoing clinical BRCA testing, the VUS rate ranges from 2% to 15% depending on the testing laboratory and patients’ ethnic background.20,21,22 While the proportion of VUS results has substantially decreased since the early 2000s (due to research and classification efforts), a significant number of individuals are informed that they carry a VUS. Widespread data sharing can help to decrease the rate of VUS test results because increased knowledge about both phenotypes and allele frequencies contribute to variant classification.

ClinVar is now the largest source of directly deposited BRCA variant data. ClinVar staff do not evaluate the biological or clinical impact of variants. Instead, ClinVar compiles and shares variant classifications performed both by labs submitting variants and by “expert panels” that evaluate variants deposited by others using as many resources as possible. ENIGMA serves as an expert panel for the BRCA1 and BRCA2 genes in ClinVar. Even for well-curated genes such as BRCA1 and BRCA2, the interpretation of variants is one of the largest hurdles in dealing with the massive amounts of data generated through gene panels as well as exome and genome sequencing. Successful VUS classification relies heavily on open access, transparent data. Open access data also allows other groups to download and redistribute data with significant enhancements. An example of this is the newly created BRCA Exchange (http://brcaexchange.org), which is striving to facilitate collection of variants and associated clinical data from around the world and display this information using a clinician- and patient-accessible interface.

BRCA testing evolves and expands

Twenty years ago, genetic testing for BRCA was offered in a limited number of academic clinical centers, and only to those who had a high prior probability of carrying a clinically significant variant. Today, hundreds of thousands of genetic tests are ordered annually in a variety of settings. Exome and genome sequencing are used clinically, particularly for undiagnosed pediatric patients and rare Mendelian disorders. Exome sequencing and gene panel testing is being used to find somatic pathogenic variants in tumors. Genetic testing of BRCA to guide treatment options such as poly ADP ribose polymerase (PARP) inhibitors is currently recommended for ovarian cancer and metastatic breast cancer and may become the standard of care for other cancers.23 There have also been calls for population-based screening of BRCA,24,25 but testing of unselected individuals is controversial. Undoubtedly, the increased screening for BRCA variants, both directly and as a secondary finding, will increase the number of VUSs reported. Ongoing deposition of these new variants and associated clinical data into public databases will be vital if expert panels are to continue their classification and resolve VUSs.26 While great progress has been made in this area, the sharing of variant data is not yet universal. Complete ascertainment of data will require changes in culture, polices, and business models, some of which hold that the patient data they generate transforms into proprietary information.

The path forward

For the last two decades, LSDBs were the main way gene-specific data were collected, stored, curated, and distributed to the community. There are several reasons for this: historically, individual scientists were experts on single genes or gene families; in the early days of sequence data acquisition there was no standardization of database architecture; and sequencing of large numbers of genes across individuals was not yet feasible. Computationally, LSDBs represented a “Tower of Babel” as each database custodian collected data in an organic way and developed their own data fields, codes, and methods of presenting data. This heterogeneity inhibited centralization. In 2013, it was estimated that there were more than 2000 databases on genes and diseases worldwide.27 Because of these issues, national centers such as NCBI, the European Molecular Biology Laboratory–European Bioinformatics Institute (EMBL-EBI), and other groups operating central databases were not interested in absorbing LSDBs. The separation of LSDBs from central sequence data narrowed with the widespread acceptance of the Leiden Open Variation Database (LOVD). The goal of LOVD is to provide a “flexible, freely available tool for gene-centered collection and display of DNA variations” (http://www.lovd.nl/). As a large number of LSDBs adopted this format, it became easier for centralized databases, such as ClinVar, to import the locus-specific information. It also enabled functional and other data to be integrated according to standardized guidelines applicable to any gene or genomic locus.

Difficult issues relating to clinical data collection on a genome-wide scale remain. One of the largest is securing sufficient and stable funding to cover the personnel and computational infrastructure required to coordinate data collection and distribution and variant curation and classification. Those depositing data also require resources to collect and prepare the data for submission. It is difficult for academics to secure grant funding for these activities, and commercial entities must use their own funds to support data sharing. When financial support for submission is no longer available, data flow stops. Curtailing either submission or curation leads to a database quickly becoming outdated. In theory, computational methods could make the entire process less labor intensive. However, the availability of large amounts of clinical sequencing data has revealed that “one size fits all” in silico–based variant classification tools perform very poorly unless they are used in conjunction with additional data such as functional assays or multifactorial models. For genes associated with very rare diseases, there may only be a small number of individuals with the expertise to appropriately assess the data. Gene-specific knowledge of elements such as key functional domains, disease-associated functions, and types of variants that are causal of phenotype remains important and is the basis for the ACMG/AMP classification scheme. Thus, the long-term need for locus-specific experts will continue.

As we move from single genes to genome sequences, we will need to determine what features of variant classification can apply to many genes and what needs to be considered on a gene-by-gene basis. The newly enacted regulations covering, and the emerging awareness of, data privacy may further complicate the sharing of individual multilocus data. Finally, even with these frameworks in place and extant expert panels for all genes, there is a need to acknowledge the importance of quality control, analytical validity, and data interpretation. Higher-throughput sequencing technology has its own weak spots in terms of analytical validity, read depth, coverage of specific regions, pseudogenes, and large rearrangements. The use of national oversight on clinical sequencing data from organizations such as the College of American Pathologists, CLIA, the Euro QC network (and others) is essential.

Conclusions

One of the critical questions moving forward is how to scale variant curation and interpretation to cover the thousands of genes associated with Mendelian disorders. Errors in classification or annotation can have clinical consequences. For example, several BRCA variants have been downgraded from pathogenic to VUS, a situation particularly likely when such variants have been identified in understudied populations, where control data might not have been available at the time of original classification.28,29 For individuals who had prophylactic mastectomies based on inaccurate classification or misinterpretation, this impact is real.30 This underscores the importance of obtaining genetic variation data from populations of diverse ancestry. This can be achieved by infusing the culture of data sharing into genetic testing labs across the globe and ensuring broad access to genetic testing services to underrepresented populations. The large numbers of clinical tests being performed, the increasing willingness of academic and commercial interests to share data, and the existence of expert panels to provide ongoing classification create a virtuous cycle. The actions of the inherited cancer susceptibility research community can serve as a model for scaling of variant curation.

One lesson we can take from the classification of variants in BRCA and BRCA2 and other cancer-predisposition genes is that there is not a universal approach to variant classification. For each gene/syndrome, classification of variants using integrated multifactorial models may require creating gene-specific tools and collecting disease-specific phenotypic data. It is critical not to lower our standards on what evidence is required for variant classification. Over 20 years of BRCA research and extensive testing data were required to arrive at our current depth of knowledge. Moving forward, we expect that the pace of variant classification and integration of genetic data into clinical settings will increase, led not only by technological innovations but also by our evolving understanding of the data required for each gene.

The history of variant classification for inherited breast and ovarian cancer has produced a set of best practices for the BRCA genes. This history can inform the field as we endeavor to understand variation in other genes. Generating such knowledge takes energy, time, and funding to generate and disseminate. In the short term, we need to be honest, comfortable, and transparent with the elements of uncertainty currently present when evaluating the clinical impact of genetic variation. The sharing of sequence and phenotypic data by researchers and clinical testing labs from around the world, serving multiple diverse populations, is essential to the classification process. We need to be aware of what has been done before so as not to “reinvent the wheel” but rather to leverage the strides that have been made in understanding the phenotypic implications of genetic variation.