Introduction

Globally, alcohol use disorders (AUD) are among the top causes of morbidity and mortality [1]. Numerous factors predispose to the risk of developing AUD. Genetic factors contribute substantial risk to the etiology of AUD [2], and the heritability has been estimated to be ~0.5 in twin studies [3]. Genome-wide association studies (GWAS) of AUD have been completed in multiple populations including European (EUR), African, Latin American and Asians [4,5,6,7,8,9,10,11,12,13]. To date, the largest GWAS of problematic alcohol use (PAU, a proxy for AUD) in 435,563 EUR subjects identified 29 independent risk variants [14]. In contrast, the largest GWAS of AUD in East Asians included less than 1% of this number: 3381 subjects (533 cases) [12]. Genetic architecture often differs between populations; polygenic risk prediction between populations, though sometimes useful, often is not transferable [15]. Thus, it is critically important that non-EUR populations be investigated to permit inferences to be made about these ancestral populations, which represent the majority of the world’s people [16, 17].

Because of the limited sample available and consequent lack of power for GWAS, little is known about the genetic architecture of AUD in East Asians. The most consistent loci identified are ADH1B (alcohol dehydrogenase 1b) and ALDH2 (aldehyde dehydrogenase). Candidate studies of ADH1B*rs1229984 and ALDH2*rs671 in East Asians showed strong associations between these functional variants and alcohol dependence (AD) [18, 19]. The first GWAS of AD in a Chinese sample was conducted in 102 male cases and 212 male controls; rs3782886 in the ALDH2 region was genome-wide significant [5] despite the very small sample size. The first GWAS of AD in Thai samples included 1045 subjects and identified rs149212747 in the ALDH2 region as the lead variant [6]. The latest GWAS of AUD in a Chinese cohort identified both ADH1B and ALDH2 genes as risk loci [12]. However, only a small proportion of the variance was explicable by variants in these genes. Larger samples are required to identify more risk variants to provide a better understanding of the genetic architecture in Asian populations.

Here we conducted a GWAS that combined five datasets from previously published cohorts and newly genotyped subjects from Thai and MVP cohorts. In total, 13,551 subjects of East Asian ancestry were analyzed, including 2254 AUD cases. We then analyzed the resulting AUD PRS in an independent East Asian sample for associations with 26 phenotypes from surveys or ICD diagnoses. This GWAS of AUD is the largest to date in East Asians.

Methods and materials

Datasets

Thai METH–GSA

As described previously [6], subjects were recruited in two stages for studies of the genetics of methamphetamine dependence (Thai METH). For both stages, subjects were recruited in Bangkok and assessed using the Thai version of the Semi-Structured Assessment for Drug Dependence and Alcoholism [20]. The IRB protocols were approved by both the Chulalongkorn University (Thailand) IRB and the Yale University Human Research Protection Program. All subjects provided written informed consent prior to their research participation.

The first stage included methamphetamine users hospitalized between 2007 and 2011 for 4 months of residential drug treatment (Thai METH–GSA, Table 1) [21]. DNA samples were genotyped on the Illumina (San Diego, CA) Global Screening Array (GSA) which includes ~640 K SNPs. Among the 863 genotyped subjects, we removed those with sample genotype call rate <0.9, mismatched genotypic and phenotypic sex, or excess heterozygosity rate [6]. Unlike for our prior report, here we retained related subjects and applied linear mixed models (LMM) to correct for relatedness (see below). SNPs with genotype call rate ≥0.95, minor allele frequency (MAF) ≥0.01, and Hardy–Weinberg equilibrium (HWE) p value >10−6 were kept for imputation. Imputation was done by IMPUTE2 [22] with 1000 Genome project phase 3 (1KG) data [23] as reference. SNPs with imputation INFO score ≥0.8, best-guess genotype call rate ≥0.95, MAF ≥0.01and HWE p value >10−6 were retained for association analyses. Principal component analysis (PCA) was performed for the remaining subjects using EIGENSOFT [24, 25]. In contrast with our previous study, here we used DSM-IV AD to define case status, rather than the DSM-IV AD criterion count to match the design in other cohorts. This yielded 127 cases and 405 were exposed controls. LMM implemented in GEMMA [26] were used to test association, with age, sex, and the first ten PCs as covariates.

Table 1 Sample characteristics.

Thai METH–MEGA

Second-stage subjects (N = 3,161; the Thai METH-MEGA sample, Table 1) were recruited from 2015 to 2020 [6]. DNA samples were genotyped using the Illumina Multi-Ethnic Global Array (MEGA) which includes ~1.78 M SNPs. We removed subjects with sample genotype call rate <0.95, sex mismatch, excess heterozygosity rate, or that were duplicates. SNPs with genotype call rate ≥0.95, or MAF ≥0.01, or HWE p value >10−6 were retained for imputation as with the Thai METH–GSA sample. The same imputation processes and post-imputation quality controls (QC) were applied. We included 794 cases and 1576 alcohol-exposed controls in the association analysis, which used GEMMA and age, sex and the first ten PCs as covariates.

MVP–EAA

The Million Veteran Program (MVP) is an ongoing observational cohort study and mega-biobank supported by the U.S. Department of Veterans Affairs [27, 28]. In October 2020, MVP released the latest genotype data (Release 4), which included 658,582 subjects. MVP subjects were genotyped using an Affymetrix Axiom Biobank Array with ~687 K markers. QC was first done by the MVP Release 4 Data Team and included the removal of duplicate DNA samples and those with sex mismatch, excessive heterozygosity, or a genotype call rate <0.985. We ran PCA for the MVP subjects with 1KG as the reference, Euclidean distances between each participant and the centers of the five reference populations were calculated using the first ten PCs, with each participant assigned to the nearest reference population. For subjects grouped as East Asian Americans (EAA), we ran a second PCA and removed outliers with PC scores >6 standard deviations from the mean on any of the ten PCs (as we did before [8]), yielding in 7364 EAAs. Imputation [22] was performed specifically for the EAAs using the 1KG as reference. SNPs with genotype call rate ≥0.95, MAF ≥ 0.01, HWE > 1 × 10−6, and imputation INFO ≥ 0.8 were retained for analysis. As for our prior study in EUR [14], subjects with ≥2 outpatient or ≥1 inpatient International Classification of Diseases (ICD) codes for AUD were defined as cases (N = 701, Table 1) and subjects with no AUD ICD code as controls (N = 6254). BOLT-LMM [29] was used to correct for relatedness, with age, sex, and the first ten PCs as covariates.

Han Chinese–Cyto

This first GWAS of AD, flushing response, and maximum daily drinks consumed in a Han Chinese family sample [5] used the Illumina Cyto12 array containing ~300 K SNPs (Table 1). Whereas the cohort was not imputed in the original report, we re-analyzed the data and imputed the SNPs for an AD GWAS. Subjects with genotype call rate <0.95, duplicated DNA samples, mismatched sex, or excessive heterozygosity were removed, resulting in 511 subjects for imputation. Imputation used IMPUTE2 and 1KG reference, SNPs with MAF <0.01, genotype call rate <0.95, HWE p value <1 × 10−6, or imputation INFO < 0.8 were removed from further analyses. Due to the drinking practices and characteristic of this particular population that result in few AD cases in females, only males were included in this analysis. After QC, 99 DSM-IV-diagnosed male AD cases and 214 male alcohol-exposed controls were analyzed using GEMMA to correct for relatedness, with age, sex and ten PCs as covariates.

Han Chinese–GSA

The second case-control AD GWAS in Han Chinese [12] included 533 cases and 2848 alcohol-exposed controls who were genotyped using the GSA array (Table 1). Here we used the summary statistics from previous study.

Meta-analysis

Using association analyses or summary statistics for each of the five cohorts, effective sample-size-weighted meta-analysis was performed using METAL [30]. SNPs present in only one cohort or in less than 15% of the total samples were removed (6.8 million SNPs remained). To define lead variants, the meta-analysis summary data were clumped by LD with r2 < 0.1 in a 2500-kb window, using 1KG East Asians as the LD reference. For the two lead SNPs in the ADH1B gene region (rs1229984 and rs1814125), we performed conditional analysis [31] for rs1814125 conditioning on rs1229984 to test if its association is independent from rs1229984. Regional association plots were generated using LocusZoom v1.4 [32] with reference LD calculated from corresponding 1KG populations. We converted the effect sizes of lead SNPs from the LMM to odds ratios (OR) for comparison and further investigation of cohort heterogeneity [33]. This method takes sample prevalence, effect size from LMM, and allele frequency as input. We also did meta-analyses for the lead SNPs using inverse variance-weighted meta-analysis using METAL and the converted log(OR) as input, for comparison.

Polygenic risk scores

Target dataset

We requested and downloaded dbGaP (phs000788.v2.p3) data from the Kaiser Permanente Research Program on Genes, Environment, and Health Genetic Epidemiology Research on Adult Health and Aging (GERA) cohort. This large and ethnically diverse cohort contains genotype data from 5182 self-reported Asians using a custom Affymetrix Axiom array [34]. All subjects completed a broad written consent.

Imputation

Subjects with mismatched sex or genotype call rate <0.95 were removed. The genomic build was transferred from 36 to 37 using LiftOver [35]. As we did for MVP, we ran PCA for the 5182 Asian subjects using the 1KG as reference, clustering them into different groups. A second PCA among Asians was used to remove outliers, resulting in 4464 genetically classified East Asians for imputation. Imputation was performed using IMPUTE2 and 1KG reference, SNPs with MAF <0.01, genotype call rate <0.95, HWE p value <1 × 10−6, or imputation INFO < 0.8 were removed from further analysis.

Target phenotypes

Two sources of phenotypes are included in this study. The first is survey data on physical observations, lifestyle and environment, including phenotypes such as BMI, general health, physical activity, alcohol use, smoking status and pack years. The second is ICD-9-CM disease and conditions measures. Participant were coded as cases if there were at least two diagnoses in a disease category. Binary phenotypes with less than 100 cases were removed from analyses. See Table 2 for details of the target phenotypes.

Table 2 Tested phenotypes in GERA and association results with AUD PRS.

Polygenic risk scoring and association

PRS-CS [36] was used to infer posterior effect sizes of SNPs using GWAS summary statistics for AUD from this study, and an external East Asian LD reference panel (generated by the authors of PRS-CS using the 1KG East Asian reference). We used PLINK v1.9 [37] for polygenic risk scoring in the GERA East Asian samples. GEMMA was used to analyze associations between the PRS and target phenotypes, accounting for relatedness and correcting for age (in 5-year categories), sex and the first ten PCs. Bonferroni correction was applied such that associations with p value <0.05/26 = 1.92 × 10−3 are considered significant.

Additional downstream analyses

We used LD score regression [38] to estimate the SNP-based observed scale heritability of AUD using 1KG East Asians as the LD reference. We also investigated the trans-ancestry genetic correlation between this study sample and PAU in EUR populations using Popcorn, a method that uses only summary-level data from GWAS while accounting for LD [39]. Trans-population meta-analysis between this study and PAU in EUR was conducted using METAL. Multi-trait analysis [40] was performed, which combined data from this study with excessive alcohol consumption defined as weekly intake >150 ml of alcohol for ≥6 months from the Taiwan Biobank [41].

Results

Genome-wide association and meta-analyses

As in our previous study of AUD in East Asians [12], in a meta-analysis here of 2254 cases and 11,297 controls, we confirmed two loci that were significantly associated with AUD (Table 1 and Fig. 1). One locus is on chromosome 4q23 and includes multiple alcohol dehydrogenase genes. After LD clumping, there are two lead SNPs in this locus. The first is rs1229984 (Arg48His, p = 3.35 × 10−17, Fig. 2a) in ADH1B (Alcohol Dehydrogenase 1B (Class I), Beta Polypeptide), the second is rs1814125 (p = 2.14 × 10−10) near ADH1C. Conditional analysis indicated that rs1814125 is not independent from rs1229984. For comparison, we also looked up the association of rs1229984 in other populations. Rs1229984 is also associated with PAU in European populations [14] (Fig. 2b). In African Americans from MVP, rs122994 is nominally significantly associated with AUD while another coding variant, rs2066702, is the lead SNP [8] at ADH1B (Fig. 2c). Another locus is a long region with high LD on chromosome 12 for which there is positive selection in East Asians [42], which includes ALDH2 (Aldehyde Dehydrogenase 2) and BRAP (BRCA1 associated protein) genes. The lead SNP is rs3782886 (p = 1.68 × 10−29), a coding variant in the BRAP gene. The previously reported functional coding variant rs671 in ALDH2 is the second most significant SNP (p = 2.70 × 10−28); it is not independent from rs3782886. No other independent associations were detected in this study. There are allele frequency differences among cohorts for these two lead SNPs, and moderate heterogeneity of effect sizes (the converted ORs) detected by the IVW meta-analyses (Fig. S1). The heterogeneity p values are 6.17 × 10−13 for rs1229984 and is 2.63 × 10−5 for rs3782886 by IVW meta-analysis, justifying the use of effective sample size-weighted meta-analysis.

Fig. 1: Association results for AUD meta-analyses.
figure 1

a Manhattan plot for AUD, ncase = 2254, ncontrol = 11,297. Effective sample size-weighted meta-analyses were performed using METAL. Red line indicates genome-wide significant after correction for multiple testing (p < 5 × 10–8), blue line indicates suggestive significant (p < 1 × 10–5). b QQ plot for AUD.

Fig. 2: Regional Manhattan plots for the top SNPs.
figure 2

a Regional plot for rs1229984 in East Asians. b Regional plot for rs1229984 in European populations in a previous study (Zhou et al. [14]). c Regional plot for rs1229984 in African Americans from a previous MVP study (Kranzler et al. [8]) where rs1229984 is nominally significant, rs2066702 is the lead variant. In total, 500 kb in the upstream and downstream of rs1229984 were presented in ac. d Regional plot for rs3782886 in East Asians. Given the high LD in this region, 1 Mb from both sides were extended.

Polygenic risk score for AUD

We calculated PRS for AUD in an independent East Asian cohort from the GERA cohort. We tested 26 phenotypes from survey and ICD-9-CM diagnosed conditions for association with the AUD PRS (Table 2). As expected, AUD PRS is significantly associated with alcohol consumption as measured in days per week of drinking (beta = 0.43, SE = 0.067, p = 2.47 × 10−10). Also, AUD PRS is nominally significantly associated with pack years of smoking (beta = 0.09, SE = 0.05, p = 4.52 × 10−2) and ever vs. never smoking (beta = 0.06, SE = 0.02, p = 1.14 × 10−2), but these associations did not survive Bonferroni correction. None of the other traits in this small target sample were associated with AUD PRS.

Additional downstream analyses

SNP-based heritability of AUD was estimated to be 0.11 (SE = 0.07), which is not a significant estimate (probably due to the limited sample size). Genetic correlation between AUD in East Asian and PAU in European samples is 0.62 (SE = 0.23, p = 8.42 × 10−3), showing moderate trans-ancestry genetic correlation. None of the trans-population meta-analysis between this study and PAU in EUR or multi-trait analysis with excessive alcohol consumption from Taiwan Biobank identified additional association signals.

Discussion

We collected data from 13,551 subjects with East Asian ancestry to conduct the largest meta-analysis to date for an alcohol-related trait in this population (quadruple the previous largest reported sample). We detected association signals at the ADH1B and ALDH2 loci with substantially stronger statistical significance than has been seen previously, but did not identify any novel risk loci. This is mostly consistent with observations from other studies, where GWAS of alcohol-related traits with sample sizes in this range are generally underpowered to detect multiple replicable variants [4, 7,43,44,45,46]. In EUR and African-Ancestry populations, the first and strongest associations detected have been at ADH1B. The ALDH2 association is, to this date, unique to East Asians (rs671, a well-known functional ALDH2 variant, is apparently unique to certain Asian populations [19]). It is a common issue for complex traits like AUD that many variants contribute to the heritability, each with a small effect size [47, 48]. Missing ancestral diversity in human genetic studies is a critical issue and recruitment of non-EUR subjects is crucial to addressing this disparity [16, 17]. The identification of ALDH2, which as noted is unique to Asians, exemplifies that there are differences in the genetic architecture of AUD between Asians and, for example, EUR, making well-powered investigations in this population an important scientific issue. Beyond identifying ADH1B and ALDH2 with greater statistical significance than previous studies, the present investigation extends prior findings in several ways, including by examining the utility of the AUD PRS derived from this meta-analysis in an independent cohort of 4,464 East Asians and testing the association between the AUD PRS and alcohol, smoking, and other traits.

The two genes implicated—ADH1B and ALDH2—are involved in ethanol metabolism [48]. ADH1B encodes an alcohol dehydrogenase that oxidizes alcohol to acetaldehyde, which is then oxidized to acetate by aldehyde dehydrogenases, including that encoded by ALDH2. This is the major metabolic pathway for ethanol metabolism but other genes are involved as well. For example, in the first step, ADH1C, ADH4 and ADH7, which map to the same chromosome 4 gene cluster as ADH1B, encode proteins that perform similar biological functions under certain conditions, ALDH1A1 and ALDH1B1 similarly encode proteins with roles that are sometimes overlapping with that of ALDH2 [49]. Given the importance of other genes in the metabolic pathway, lead variants in genes other than ADH1B*rs1229984 (EUR and Asian) and ALDH2*rs671 (Asian) have been reported [6, 8, 12, 14]. Some of these associations are supported by conditional analyses [8, 14], and some appear to be variants that “hitchhike” with rs1229984 or rs671 due to their strong LD. Here, conditional analyses identified only one lead variant at each locus: rs1229984 (p = 3.35 × 10−17) in the AHD1B region and rs3782886 (p = 1.68 × 10−29) in the ALDH2 region. The high LD between rs3782886 and rs671 (r2 = 0.98) makes it difficult to distinguish the real causal variant, though biochemical analysis favors rs671 (reviewed in [48]), which is nearly a null variant. A single copy of the rs671*T allele renders the aldehyde dehydrogenase protein product nearly inactive and it is also more rapidly degraded, which causes flushing in East Asians and other associated symptoms that are protective against heavy drinking and AUD [50].

Rs3782886 in the BRAP gene (breast cancer suppressor protein (BRCA1)-associated protein) has been associated with many traits in East Asians, include alcohol-related traits [41, 51], myocardial infarction [52], and a biochemical trait—alanine aminotransferase level [53]. Some or all these associations could be due to the high LD with rs671 (as in this study), or reflect effects on activity of the metabolic pathway or cerebral cortical neurogenesis (argued in [41]). We would suggest that the different lead variants (rs671 or rs3782886) in this high LD region could reflect uncertainty introduced by different SNP arrays, imputation processes, association analyses, or random variation in comparatively small samples. More data are needed to ascertain the true causal variant (or variants) despite the previous support and mechanistic appeal of rs671.

We used additional analyses to explore the genetic architecture of AUD in East Asians. The SNP-based heritability estimate was very low with a large standard error (SE), indicating a lack of statistical power. Moderate genetic correlation (rg = 0.61, SE = 0.23, p = 8.42 × 10−3) was detected between the main meta-analysis of this study and PAU in EUR populations, indicating shared genetic architecture across ancestries. However, the trans-population meta-analysis in which PAU in EUR was added did not detect any novel signals, probably due to the limited power in this study. Multi-trait analysis combining this study sample and excessive alcohol consumption from the Taiwan Biobank also identified no novel variants. Thus, additional study samples of East Asian ancestry are needed to provide adequate power for GWAS of AUD in East Asians.

Since it is a genetically complex trait, we expect that there are many variants that contribute to the genetic risk of AUD, consistent with findings in EUR [14]. Polygenic risk score analysis is a powerful tool for the application of GWAS results to investigate associations with traits of interest, which has been used widely in studies to test the association with AUD or related phenotypes in target cohorts [7, 8, 12, 14]. Here, we analyzed AUD PRS from our meta-analysis in an independent East Asian cohort from GERA, a US cohort collected to facilitate research on the genetic and environmental factors that affect health and disease [34]. We tested the association between AUD PRS and 26 phenotypes in 4464 subjects of East Asian ancestry. AUD PRS was significantly associated with alcohol consumption as measured using days of drinking per week (see Table 2), and nominally significantly associated with pack years of smoking and ever vs. never smoking, consistent with the shared genetic architecture of AUD and alcohol and (possibly) smoking traits in East Asians. These same, or closely similar, associations, have been well established in EUR [8, 14]. These was no association detected between AUD PRS and other diseases or conditions in this study.

This study has limitations, the most important of which is the sample size, which despite being the largest reported so far for East Asian provides limited statistical power. Second, the phenotypes among the different study samples are not identical, with AUD diagnosed as ICD-9/10 codes in MVP and DSM-IV AD in other cohorts. This analytic approach is supported by the high genetic correlation between AUD and AD in EUR, which is estimated to approach 1.0 [14]. Third, some cohorts used alcohol-exposed controls, and others used unscreened controls (i.e., the MVP). Controls with demonstrated exposure to alcohol are ideal, but such exposure is commonplace in all the populations studied. Finally, although all of the cohorts are of East Asian ancestry, there are population differences among cohorts that increase heterogeneity and reduce power [54, 55]. These include cultural or environmental differences that affect trait prevalence (e.g., drinking practices), and geographical differences that introduce genetic differences (Fig. S1).

In conclusion, we conducted a GWAS of AUD in 13,551 East Asian subjects, in which we confirmed the two previously known risk loci and applied the AUD PRS in an independent cohort. Despite a large increment in sample size over the previous largest Asian-population GWAS, the power remains an important limitation. Accordingly, we will continue to recruit more East Asian subjects for alcohol studies and urge other investigators to do the same.