Large-scale studies derived from genomic and health data repositories are at the forefront of personalized medicine. The US NIH Cancer Genome Atlas Project, for example, has generated maps of genomic changes in a large variety of cancer types and cases to help improve diagnosis and treatment based on the large-scale genome sequencing of patients. Another recent example is polygenic risk scoring (PRS), which, despite criticism that it should be carefully evaluated, will be used by the NHS for risk assessment as part of clinical decision-making, including access to screening. PRS, too, has become more widely available due to the development of population-level genetic studies, particularly genome-wide association studies (GWAS). Both cancer genomics and PRS demonstrate the need for and challenges involved in comparable and viable data sharing in population-scale genomic research. Here, we focus on genomic repositories sharing data among themselves, with researchers, and with donors. Despite the rapid expansion of large-scale national genomic repositories and efforts to create diverse datasets, genomic diversity, which is crucial in comparative research, is not easily attainable. Many barriers to sharing genomic data have already been identified, including the interrelated challenges of comparability, confidentiality, and viability.

First, let us consider comparability. While the research community has agreed that many more samples and genomic data are required, the diversity and limited availability of genomic data often create challenges in terms of comparability. As previous studies have shown by comparing biobank data from the UK, Japan, and Taiwan, PRS’s prediction accuracy based on UK data was far lower in non-European populations. Indeed, it was 2.5-fold lower in East Asians and 4.9-fold lower in Africans, on average [1]. Similarly, the differences in accessible genomic data between populations have led, for example, to the development of Japanese oncogenetic panels for the Japanese population [2]. Transforming genomic data into scientific insights and knowledge requires appropriate quantifications and annotations from medical and clinical perspectives. Particularly when researchers use decentralized datasets internationally, while the processing of genomic data must be standardized, direct access to raw data is often avoided due to privacy concerns. However, these data repositories are still thought of as local and separate datasets [3]. In this regard, one of the key tasks is harmonizing the interpretation and reporting of disease-associated markers detected during sequencing, for example, cancer germline/somatic mutations and their various categories of variants [4]. Moreover, there is an on-going struggle to integrate other health data from separate labs or hospitals with genomic data.

Second, to ensure the privacy and confidentiality of participants, data from national healthcare systems are rarely accessible outside of institutional boundaries. Even in digital form, individual genomic data are carefully protected sensitive data that, with effort, can be re-associated with specific individuals. Furthermore, biomedical researchers require not only access to genomic data but also additional medical and lifestyle information from individuals, creating further challenges regarding confidentiality and privacy.

Third, the continued viability and usefulness of repositories require significant resources [5]. While some biobanks and data repositories are solely supported by governmental funding, others depend on partnerships with commercial companies or have a “self-sustaining” business model. Because DNA donation is built on public trust and solidarity, when biobanks are connected with pharmaceutical companies as part of a joint venture, the impact of such public–private sector partnerships on public trust and the solidaristic ethos of “sharing while caring” should become a matter of concern [6].

The above-mentioned challenges involved in sharing genomic data can be demonstrated by looking at the UK’s, Japan’s, and Israel’s biobanks. In these three population-based initiatives, DNA samples were collected from donors, and together, they could lead to increased comparability, data sharing, and viability. The UK Biobank Project sequenced the genomes of over half a million individuals. Designed to be representative of the general population of the UK, 94% of individuals whose data are in the UK biobank are “White” Europeans; in addition, donors, as compared to the general population, are older, better educated, wealthier, and generally healthier [7]. Japan’s biobank is not diverse either. Given its home country’s ethnic diversity, the Israeli “drop for research” biobank could be relatively more diverse than the other two biobanks, but no data are available regarding its actual ethnic composition. Together, these repositories are diverse, yet we are not familiar with any plans for collaboration among these three biobanks to increase comparability.

While all three biobanks maintain participant privacy and obtain broad consent, they differ in terms of their policies on data sharing with donors. For instance, the Tohoku Medical Megabank (TMM) Project in Japan, with more than 150,000 participant samples, carefully returns individual genomic alongside genetic counseling [8]. With 150,000 donor samples, the Israeli biobank’s policy is to return only actionable incidental genetic findings to donors [9]. The UK biobank will return only nongenetic findings measured at enrollment [10]. Finally, in terms of viability, while the UK and Japanese biobanks can have relatively secure funding through independent non-commercial funding bodies and governmental funding agencies, the Israeli biobank, which is operated by an HMO (Maccabi), adopts a “self-sustaining” business model that depends on cost recovery through user fees.

The challenges we described are expected to intensify in the near future as biobanks and cancer sequencing are merging. Comparability and data sharing are becoming even more crucial as AI algorithms are increasingly used for GWAS. The challenge posed by the variety of biobanks’ consent models may lead to adding dynamic re-consent, in addition to the original broad consent used, perhaps due to future coordination [11]. Federated Health Data Networks (FHDNs) have recently been proposed to facilitate the sharing of sensitive health data across healthcare institutions as well as regional and national borders [12]. In this model, a series of decentralized, interconnected nodes allows data to be queried by other nodes in the network without the data leaving the node it is located at. As opposed to data sharing, transfer, or pooling, FHDNs facilitate data access or data visiting, meaning that queries and algorithms, which are increasingly used to analyze large genomic datasets, can be sent and applied to the pseudonymized data. Genomic repositories must explore these new technological options, along with appropriate public engagement and genomic promotion programs, to increase their comparability, data sharing, and viability.