INTRODUCTION

The limb-girdle muscular dystrophies (LGMDs) are a heterogeneous group of diseases, causing pelvic and shoulder girdle muscle weakness and wasting. There are currently 32 characterized subtypes1 with a diverse range of clinical phenotypes, which show variability in age of onset, rate of progression, specific muscle wasting patterns, and involvement of respiratory and cardiac muscles. The subtypes are broadly categorized by their pattern of inheritance as either dominant (LGMD1A-I) or recessive (LGMD2A-X), with the majority being recessive, and can harbor either loss-of-function or missense pathogenic variants. The proteins encoded by LGMD disease genes have cellular functions including glycosylation and muscle membrane integrity, maintenance, and repair, which are a diverse range of mechanisms that when disrupted all result in muscle damage and degeneration.

Currently, an effective treatment does not exist for any LGMD subtype; however, promising gene therapy clinical trials have commenced for LGMD2E and additional subtypes are set to commence in 2019–2020.2 Disease prevalence information is critical to the planning and prioritization of these clinical trials. Historically, the prevalence of rare diseases has largely been estimated from epidemiological surveys and patient registries.3,4,5 However, it can be difficult to achieve an accurate and meaningful prevalence estimate for rare genetic disorders through these traditional approaches. Many patients with rare disease experience a delayed or incorrect diagnosis, which can be more pronounced for late onset, slowly progressing diseases such as LGMD,6 leading to underestimation of prevalence. Differences in the diagnostic criteria used between studies, as well as changes to these over time, can make it difficult to directly compare estimates across studies. The specific population studies can also bias the prevalence estimate, and indeed the current published prevalence of LGMD subtypes can vary greatly between countries and even regions within countries.7 The factors contributing to these regional differences include small sample size, founder variants, and consanguinity rates—all of which can lead to increased incidences of LGMD in those populations.8,9,10 In addition, the resources and training available to each health-care system can contribute to regional variability. Improved methods for quantifying the prevalence of rare genetic disorders such as LGMD are thus needed.

Using variants identified by large human exome and genome research studies as population references has greatly aided the filtering and interpretation of variants found in individuals with rare disease, and the study of known disease variants in the general population.11 The growth of these population genetic databases has enabled allele frequency data to be more widely used for estimating disease prevalence. However, there have been two main challenges with using allele frequencies from population reference databases to estimate prevalence. Firstly, the sample sizes can be insufficient to robustly estimate allele frequencies associated with rare diseases for which the majority of pathogenic variants are observed rarely in the general population. In addition, many databases have been inadequate for the estimation of disease prevalence in non-European populations. Although Bayesian methods for estimating disease prevalence have been developed and applied to allele frequency from large databases,12 they currently do not incorporate separate prior distributions for each functional annotation (e.g., nonsense, missense, etc.).

In this study, we used publicly available population references to obtain a more robust disease prevalence estimation for recessive LGMD (LGMD2). Previous epidemiology studies (Table 1) and approaches using population reference panels have been biased and would vary a lot across different reference databases when using allele frequencies based on one single observation. Although overlapping variants from reference databases have similar allele frequencies for common variants (>0.5%), they may differ greatly at lower allele frequencies (<0.5%). In fact, 69% of European singletons within the Exome Sequencing Project (ESP) are not observed in the much larger ExAC data set.13 To overcome this bias, we introduced a Bayesian method here to re-estimate allele frequencies, taking advantage of prior knowledge in the overall distributions of allele frequencies for different functional annotations (e.g., missense, frameshift, etc.). We developed a Bayesian framework to gain robust prevalence estimates with a confidence interval. By utilizing population reference panels from ExAC and gnomAD, we simultaneously re-estimated allele frequencies for various functional annotations via a Bayesian method and then estimated disease prevalence assuming Hardy–Weinberg equilibrium. Overall, we provide a generalizable and robust framework to estimate disease prevalence for LGMD2 subtypes that can be easily adapted for other autosomal recessive diseases.

Table 1 Estimated prevalence (per million) in nine LGMD2s and published epidemiology estimates.

MATERIALS AND METHODS

Identification of pathogenic variants

For each disease gene, variants were downloaded from the gnomAD database. The Emory Genetics Laboratory (EGL) and ClinVar databases30 were used to annotate known pathogenic variants. Retrieved variants were first filtered based on their allele frequencies (AF). Only variants whose minor allele frequencies are less than 0.05% in the gnomAD database were kept, unless they have been annotated as pathogenic or pathogenic/likely pathogenic in either of these two databases (EGL and ClinVar). Using the American College of Medical Genetics and Genomics (ACMG) guidelines for defining pathogenic variants,11,30 we classified loss-of-function type variants as pathogenic (e.g., frameshift, stop gain, splicing donor, splicing acceptor) whether or not they were listed as pathogenic in the EGL or ClinVar databases. For the other types of variants, as long as they were annotated as pathogenic in either the EGL or ClinVar database, they were classified as pathogenic.

The above analysis is limited to known pathogenic variants and loss-of-function variants. We used the Combined Annotation Dependent Depletion (CADD) score31 cutoffs to include more variants as potentially pathogenic. We applied two CADD Phred-scaled score cutoffs at 20 and 30. For further comparison, we also included all rare (AF < 0.05%) missense variants to get the upper bound of estimated disease prevalence.

Bayesian estimation of allele frequencies and disease prevalence

The development of the disease prevalence estimator builds upon a previously published method and is detailed below.

Problem setting and prior assumptions

For a single variant, we would assume the observed allele count of the variant follows a binomial distribution Binomial(qi, 2ni), where ni is the number of individuals having genotypes genotyped at this position in the database and qi is the true allele frequency for this variant.

Since the conditional distribution of the observed allele count for a variant conditioned on the allele frequency qi is a binomial distribution, we introduced a conjugate prior of qi, qi ~ Beta(vc:ic, wc:ic), where vc:ic and wc:ic denote the prior parameters for variants belonging to the category c, which are estimated using method of moments based on all variant data provided in the ExAC database.13 We grouped all variants into eight categories: frameshift, splice acceptor, splice donor, stop gained, missense, untranslated region (UTR) (including 3′ and 5′ UTR), other exonic, and other variants. The allele frequencies for variants of a functional annotation are assumed to follow the same prior distribution across all genes.

In an additional analysis exploring possibly more informative priors, the CADD score was incorporated in the prior as score ranges in four groups: <5, 5–30, >30, and those without a score. In combination with the eight functional categories (mentioned above), we created a total of 32 categories with allele frequency priors, then calculated similarly across all genes.

We use method of moments to estimate two hyperparameters vc:ic,wc:ic in the beta prior for allele frequency qi. More specifically, we get these two parameters by solving the following linear system of equations:

$$\left\{ {\begin{array}{*{20}{l}} {\frac{{v_{c:i \in c}}}{{w_{c:i \in c}}}} \hfill & = \hfill & {\frac{{\mathop {\sum }\nolimits_{j = 1}^p \frac{{x_i}}{{2n_i}}{\mathbb{1}}\left\{ {j \in c} \right\}}}{{\mathop {\sum }\nolimits_{j = 1}^p {\mathbb{1}}\left\{ {j \in c} \right\}}}} \hfill \\ {\widehat {\mu _c}} \hfill & = \hfill & {\frac{{\mathop {\sum }\nolimits_{j = 1}^p \frac{{x_i}}{{2n_i}}{\mathbb{1}}\left\{ {j \in c} \right\}}}{{\mathop {\sum }\nolimits_{j = 1}^p {\mathbb{1}}\left\{ {j \in c} \right\}}}} \hfill \\ {\frac{{v_{c:i \in c}w_{c:i \in c}}}{{(v_{c:i \in c} + w_{c:i \in c})^2(v_{c:i \in c} + w_{c:i \in c} + 1)}}} \hfill & = \hfill & {\frac{{\mathop {\sum }\nolimits_{j = 1}^p \left( {\frac{{x_i}}{{2n_i}}{\mathbb{1}}\left\{ {j \in c} \right\} - \widehat {\mu _c}} \right)^2}}{{\mathop {\sum }\nolimits_{j = 1}^p {\mathbb{1}}\{ j \in c\} }}} \hfill \end{array}} \right.$$

where p is the total number of variants in the reference panel here, including both pathogenic and nonpathogenic ones. \(1\left\{ {j \in c} \right\}\) is the indicator function indicating whether the variantj belongs to the category c or not. If it belongs to the category, then the function would give a value of 1; otherwise it would give 0.

Posterior distribution of allele frequencies

The posterior distribution of the allele frequency qi given the observed allele counts xi and prior assumption on the allele frequency would be

$${{\mathrm{\pi }}\left( {q_i{\mathrm{|}}x_i,2n_i} \right) = \frac{{\pi \left( {x_i,2n_i{\mathrm{|}}q_i} \right)\pi (q_i)}}{{\mathop {\int }\nolimits_0^1 \pi \left( {x_i,2n_i{\mathrm{|}}q_i^\prime } \right)\pi \left( {q_i^\prime } \right)dq_i^\prime }}}$$
$${\pi \left( {q_i{\mathrm{|}}x_i,2n_i} \right) = \frac{{\left( {\begin{array}{*{20}{c}} {2n_i} \\ {x_i} \end{array}} \right)B^{ - 1}\left( {v_{c:i \in c},w_{c:i \in c}} \right)q_i^{x_i + v_{c:i \in c} - 1}\left( {1 - q_i} \right)^{2n_i - x_i + w_{c:i: \in c} - 1}}}{{\mathop {\int }\nolimits_0^1 B^{ - 1}\left( {v_{c:i \in c},w_{c:i \in c}} \right)(q_i^\prime )^{x_i + v_{c:i \in c} - 1}\left( {1 - q_i^\prime } \right)^{2n_i - x_i + w_{c:i: \in c} - 1}dq_i^\prime }},}$$

where \(B^{ - 1}\left( {v_{c:i \in c},w_{c:i \in c}} \right)\) is the inverse of the beta function \(B(v_{c:i \in c},w_{c:i \in c})\), which makes the total probability of beta distribution \(Beta(v_{c:i \in c},w_{c:i \in c})\) be 1. Based on the equation above, we can infer that the posterior distribution of qi is a beta distribution: \(Beta(x_i + v_{c:i \in c},2n_i - x_i + w_{c:i \in c})\). For pathogenic variants (from EGL and ClinVar) unseen in the population reference panel, we would take xi being 0 and ni being the corresponding sample size in the mixed population or the specific subpopulation.

Posterior estimation of disease prevalence

For monogenic rare diseases the disease prevalence would be \(D = [ {1 - \mathop {\prod }_i \left( {1 - q_i} \right)}]^2\). This is the probability of both copies of the disease gene having at least one pathogenic variant. We can use \(D \approx ( {\mathop {\sum }_i q_i})^2\) to approximate the disease prevalence, which indicates that the appoximated posterior of the prevalence is a chi-square distribution with one degree of freedom. Using \(\hat D\) to denote the approximation term for disease prevalence \((\mathop {\sum }\nolimits_{q_i})^2\), we can get \(\frac{{\hat D}}{{\sigma ^2}} \sim \chi _1^2(\lambda )\) and \(\lambda = \frac{{\mu ^2}}{{\sigma ^2}}\). We are using the expectation (λ + 1)σ2 of the distribution as the prevalence estimator here. The lower bound of the estimator with the confidence 1 − α would be \(F^{ - 1}\left( {\frac{\alpha }{2}} \right) \times \sigma ^2\), where F(.) is the cumulative distribution function for the chi-square distribution, similarly for the upper bound. We are using α = 0.05 here to get the 95% confidence interval. Detailed derivation of equations can be found in Supplementary Methods.

Direct estimation of disease prevalence in genetic databases

For comparison, we also estimated disease prevalence by using the observed allele frequency of a pathogenic variant in genetic databases as the direct estimator for qi (without beta prior). More specifically, the disease prevalence can be estimated by

$$D_{direct} = \left[ {1 - \mathop {\prod }\limits_i \left(1 - \frac{{AC_i}}{{AN_i}} \right)} \right]^2$$

where ACi is the allele count for the variant i and ANi is the corresponding allele number in the position. As above, for a given disease or a subtype, the product is taken over all identified pathogenic variants in the disease gene, where i is the index of those identified pathogenic variants.

The scripts for estimating recessive disease prevalence based on our Bayesian framework and also direct calculation are available at https://github.com/leklab/prevalence_estimation.

RESULTS

Prevalence estimates in LGMD2 subtypes are comparable with published values

The recessive LGMDs (LGMD2) are autosomal recessive diseases that can be caused by pathogenic variants in at least 24 genes.1 We applied our Bayesian method to nine subtypes of LGMD2 from 2A to 2L (Table 1). The gnomAD data set was used to identify putative and reported pathogenic variants in each disease gene. The disease prevalence estimates calculated by our Bayesian method were generally consistent with published prevalence estimates from epidemiological studies (Table 1), in particular for LGMD2A, LGMD2E, and LGMD2I. For other subtypes, our method produced a higher estimated prevalence, including LGMD2B, LGMD2D, and LGMD2L. These differences can be partly explained by the underdiagnosis of these late-onset or slowly progressive LGMD subtypes.14,15 In contrast, our disease prevalence estimation for subtype LGMD2C (0.12 per million) was notably lower than the lowest published value (1.3 per million). Genetic differences across regions would also contribute to discrepancies between our results and published estimators, since most epidemiology studies have been conducted in small regions, while the databases we used include individuals with diverse genetic backgrounds. Lastly, no comparison could be made for LGMD2F and LGMD2G because there are no published prevalence estimates.

Next, we applied our method to another genetic database, BRAVO, to estimate prevalence for the same nine LGMD2 subtypes. When applied to a different database, our method provided more robust results compared with direct prevalence estimation (see “Materials and Methods”) using genetic data. Prevalence estimates for six of nine subtypes estimated in BRAVO fell in the 95% confidence intervals (CIs) estimated from the gnomAD data. The other three subtypes (2A, 2D, and 2I) had an estimated prevalence close to the lower bounds of the corresponding 95% CI (Table 2). Applying the same method (either our Bayesian method or the direct way, see “Materials and Methods”) in two different databases yields much larger differences than applying two different methods in the same data set, indicating the database used is the greater influence, as opposed to the method. The large differences in results from different databases can be partly explained by the sampling biases and limited sample size in each genetic data set.

Table 2 Estimated disease prevalence (per million) in gnomAD and BRAVO for nine LGMD2s.

Including rare missense variants currently not reported as pathogenic increases prevalence estimates

The above prevalence estimates are limited to reported pathogenic and rare loss-of-function variants found in gnomAD and do not account for other unreported missense pathogenic variants that may be in gnomAD. When we included all rare missense variants (AF < 0.05%), not surprisingly the prevalence estimates increased dramatically (Table 3) compared with the results indicated above. This increased prevalence was proportional to the coding length of the gene, as larger genes will accumulate more rare variants by random chance.

Table 3 Prevalence estimated (per million) including more predicted pathogenic variants.

This analysis assumes all rare missense variants are pathogenic, which is likely not the case. We then applied the CADD16 method to classify the pathogenicity of rare missense variants. The CADD Phred-scaled cutoff scores of 20 and 30 were used to define pathogenicity, which respectively represent the top 1% and 0.1% of most deleterious substitutions predicted by the CADD method, i.e., the higher the score, the more likely a variant will be pathogenic. The published prevalence estimates still fell outside of the 95% CI calculated when missense variants with a cutoff score of 20 were included, while the more stringent cutoff score of 30 produced closer estimates (see Table 3). For example, with LGMD2E, the estimated prevalence using a cutoff score of 30 is 1.1 per million, similar to the published 0.7 or 0.86 per million, and is within the 95% CI (0.4 to 1.3) estimated when only considering rare loss-of-function variants and variants annotated as pathogenic in ClinVar or EGL. These results show that improved pathogenicity prediction methods are required to improve disease prevalence estimates.

Comparison with epidemiological results in population stratified analysis

The majority of epidemiological studies estimating disease prevalence have been conducted in small regions, leading to varying results across publications. LGMD2A serves as an example, where the estimates vary greatly in two small regions of Italy (6.1 and 16.5 per million).17 Due to the majority of the published estimates being from European populations, we limited our analysis to the subpopulations of European (EUR), Finnish (FIN), and non-Finnish European (NFE) here; results of additional subpopulations are shown in Table 1.

After applying population stratification, estimated prevalence is more comparable with previously published results (see Table 1). For LGMD2A, the prevalence was estimated at 9.4 per million (95% CI: 7.1–11.8 per million) in the NFE population, matching the published value of 9.4 per million in northeastern Italy. However, after population stratification, the prevalence estimations for some subtypes diverged further from the published values. For LGMD2L, compared with the estimator (17.6 per million) in a mixed population, the estimator (27.3 per million) in the NFE population is even higher than the published prevalence (2.6 per million) in northern England.18 The much higher result could be caused by the elevated allele frequency of a founder variant, ANO5 NM_213599.2:c.191dupA,8 in the NFE population (0.21%) compared with the allele frequency in the mixed population (0.11%) in gnomAD. For subtypes only common in certain populations, the stratification can provide a more precise prevalence estimate. For example, the prevalence of 2G is estimated in East Asians (EAS) to be about 1.2 per million, while it is less than 0.05 per million in other populations (Table 1). This result suggests that varied genetic backgrounds can lead to population differences in disease prevalence estimates, which can be shown in results from both epidemiological studies and genetic databases.

Inclusion of CADD in the prior and unseen pathogenic variants

We next performed two additional analyses specifically facilitated by using a Bayesian framework. CADD scores were used to categorize variants in combination with functional categories for updating allele frequencies of pathogenic variants (see “Materials and Methods”). After categorizing variants into smaller specific groups, disease prevalence was re-estimated for LGMD2 subtypes (Supplementary Table 2). Compared with results in Table 1, prevalence estimates were overall very similar with only small changes in subpopulations.

There are a number of reported pathogenic variants that were not observed in gnomAD (Supplementary Table 1) due to sampling and being ultrarare variants. Using a Bayesian framework these variants can be included in the prevalence estimates resulting in slightly higher estimates in the subpopulations (Supplementary Table 3). Furthermore, this can provide a nonzero estimate and confidence intervals in instances where no pathogenic variants are observed in the subpopulation (e.g., LGMD2G prevalence in the Ashkenazi Jewish subpopulation).

Estimating prevalence in well-studied diseases

To further confirm the reliability of our results, we also applied our method to three non-neuromuscular diseases; sickle cell disease,19,20 cystic fibrosis,21 and Tay–Sachs disease,22,23 and estimated their prevalence in the subpopulation where they were sourced. Known pathogenic and putative loss-of-function variants for the corresponding disease genes HBB, CFTR, andHEXA were extracted from gnomAD (see “Materials and Methods”) and our Bayesian method was used to calculate the posterior allele frequency distributions and an estimate of disease prevalence. The published estimates were within the confidence intervals for Tay–Sachs. For sickle cell disease, differences can be explained by HBB alleles associated with other β-hemoglobinopathies such as β-thalassemia.20 In addition, prevalence adjustment21 was required for cystic fibrosis as an early-onset life-shortening disease (Supplementary Material). Overall, taking this extra information into consideration, our prevalence estimates for these three diseases are similar to published figures, indicating that our method is robust across multiple autosomal recessive diseases.

DISCUSSION

Through the application of a Bayesian method to large publicly available genetic databases, we have determined robust prevalence estimations for LGMD2 subtypes that are consistent with published figures from epidemiological studies. By applying our method of calculating prevalence to another genetic database, BRAVO, the robustness of the method was confirmed since most prevalence estimates from BRAVO were within our estimated confidence intervals using gnomAD. For further evaluation, we estimated prevalence for three nonmuscular diseases using the method and generated similar values to published results.

Building upon a previous Bayesian prevalence estimation method,12 we estimated LGMD2 prevalence by simultaneously considering more than one variant using much larger databases, which mitigates underestimation of disease prevalence. We also considered functional annotation when updating allele frequency for each variant. Utilization of the largest genetic databases available also made our estimation more robust, since databases with insufficient sample size would lead to increased absence of rare pathogenic variants. Although we have extended the previous Bayesian method, similar challenges still remain. First, there is an assumption of Hardy–Weinberg equilibrium, which can deviate when using aggregated population data. Specifically, the gnomAD data set contains consanguineous populations inferred by their higher inbreeding coefficients and also population stratification due to aggregating subpopulations into large continental groups.13 Also, the classification and reporting of rare variants as pathogenic has been challenging despite established guidelines.11 Although promising, our results, presented in Table 3, suggest that further improvement in computational pathogenicity prediction methods for rare missense variants is required and overall the false positives are still too high for these to be used for prevalence estimates.24,25 Lastly, pathogenic variants are assumed to be independent of each other and therefore this method does not account for rare variants that occur on the same haplotype (i.e., linkage disequilibrium).

In addition, there are assumptions that may affect our results in the context of LGMD prevalence estimates. First, we assumed pathogenic variants observed in compound heterozygous and homozygous states have the same severity, which may result in differences compared with published values. For example, the c.191dupA founder variant in ANO5 is observed as homozygous in one individual in gnomAD, suggesting later onset and/or a much milder muscle phenotype associated with being homozygous for this variant. Second, the analysis is limited to single-nucleotide variants (SNVs) and small insertions and deletions. Large duplications and deletions account for some of the pathogenic variants discovered in neuromuscular disease genes with some having higher frequencies due to founder effects such as the exon 55 deletion in NEB26 associated with autosomal recessive nemaline myopathy. Furthermore, we assume that all pathogenic variants for a subtype have been identified in the database we used here, which is likely not true (Supplementary Table 1), and may lead to an underestimate of disease prevalence. Conversely, the current analysis does not take into account the situation where multiple disorders are caused by variants in the same genes. For example, in the case of FKRP, the prevalence estimate includes both LGMD and Walker–Warburg syndrome variants,27 and thus can overestimate the LGMD prevalence. As variant databases become more comprehensive, this information can be accurately extracted to mitigate this overestimation. Lastly, we have only estimated prevalence in this study for recessive LGMD2 disorders where compound heterozygous or homozygous variants cause disease. Recently, several heterozygous variants in genes associated with LGMD2 subtypes have been identified that can act dominantly, such as a 21-bp deletion in CAPN3.1 The method we developed here is limited, however, for the estimation of dominant LGMD prevalence since dominant variants are expected to be largely absent from population databases ExAC and gnomAD, while any present may be further complicated by reduced penetrance. Taking the general and LGMD-specific assumptions together explains some of the discrepancies between published epidemiology reports and the results presented in our study.

In contrast to published prevalence estimates from epidemiology studies, our results based on allele frequencies obtained from population reference databases are not impacted by public policy and are not health system–specific to countries or regions. However, our results are also affected by different genetic backgrounds across regions (LGMD2L: 17.63 per million in the global population and 27.33 per million in the non-Finnish European population). Additionally, differences in sample sizes of various subpopulations in the genetic database used would also affect the identification of causal variants. Although the sample size of the database used here is the largest available, some rare pathogenic variants are likely to still be missing due to an insufficient sample size, which further leads to underestimation of prevalence, especially for rarer subtypes. The underestimated prevalence of LGMD2C (0.12 per million compared with 1.3 per million) may be caused in particular by the absence of various pathogenic variants in the database used. The Bayesian framework allowed for reported pathogenic variants unseen in gnomAD to be included in the prevalence estimates; however, it did not result in much difference except in certain subpopulations (Supplementary Table 3). Future work may include estimating allele frequencies for the absent pathogenic variants by incorporating the UnseenEst method, which was successfully applied to estimate unseen variants in ExAC.28

Overall, our method provides a generalizable and robust framework to estimate disease prevalence for recessive forms of LGMD and can be adapted to estimate prevalence for other recessive diseases. By utilizing a Bayesian framework on data from the largest population reference panels (gnomAD and ExAC), this method can obtain more refined allele frequencies for rare pathogenic variants and include additional pathogenic variants from other disease databases to achieve improved disease prevalence estimates. This includes a framework for estimating the allele frequency priors, where functional annotation and CADD score groupings were used as an example. Future work will involve exploring more informative priors to improve estimates. Lastly, we have made our scripts and data available (see “Materials and Methods”), which can be easily adapted to other recessive disease genes of interest to calculate reproducible and robust estimates.

Published prevalence estimates for recessive LGMD are generally from epidemiological research studies, which are vulnerable to inaccuracies associated with delays in diagnosis or misdiagnosis, variation in diagnostic criteria used, and biases introduced by the specific population sampled.29 By applying a Bayesian method to a genetic database, our method provides robust disease prevalence estimates for recessive LGMD from the genetics perspective.

URLs

gnomAD: http://gnomad.broadinstitute.org/downloads

Emory Genetics Laboratory database: http://www.egl-eurofins.com/emvclass/emvclass.php

ClinVar database (the version used here is 20180429): ftp://ftp.ncbi.nlm.nih.gov/pub/clinvar/vcf_GRCh37/

ExAC database: ftp://ftp.broadinstitute.org/pub/ExAC_release/release1/manuscript_data/

BRAVO database: https://bravo.sph.umich.edu/freeze5/hg38/