Introduction

Colorectal cancer (CRC) is one of the most common malignancies worldwide1. Inherited genetic factors play an important role in the development of CRC2. Since 2007, genome-wide association studies (GWAS) have identified over 200 common genetic variants independently associated with CRC risk3,4,5,6,7. These GWAS, however, typically only reported the most significantly associated variant (the lead variant) at each risk locus. Statistical fine-mapping analyses of known risk loci can identify additional association signals independent of the lead variant.

Approximately 90% of GWAS-identified risk variants for CRC are located in noncoding or intergenic regions, and target genes for most of these risk variants remain unknown. Well-powered fine-mapping analyses, particularly those using data from multi-ancestry populations, can facilitate the identification of credible causal variants (CCVs) in each region. Previous genetic studies have provided strong evidence that regulatory variants in linkage disequilibrium (LD) with GWAS-identified risk variants drive the associations of genetic variants with cancer risk by modulating the expression of susceptibility genes8,9,10,11. Therefore, integrating functional genomic data to interrogate CCVs in each independent risk-associated signal could help to identify putative causal variants and target genes for CRC risk. Herein, we conducted large trans-ancestry fine-mapping analyses of all currently known CRC risk regions, using GWAS data from 100,204 CRC cases and 154,587 controls of East Asian and European ancestry, to identify independent association signals and their target genes for CRC risk.

Results

Identification of independent association signals with CRC risk

We conducted fine-mapping analyses using GWAS summary statistics from 100,204 CRC cases and 154,587 controls (73% European and 27% East Asian ancestry) (Fig. 1, Supplementary Data 1). In our recent trans-ancestry meta-analysis of GWAS, we identified 205 genetic variants independently associated with CRC risk7. We aggregated regions flagged by these variants into 143 risk regions, each containing at least a 1 Mb interval centered on the most significant association (Supplementary Data 2). Among them, 40 regions harbor at least two reported independent risk associations. All risk regions were autosomal, except the one at Xp22.2. For subsequent analyses, we focused on the 142 regions located on the autosomes.

Fig. 1: Schematic diagram of the study design.
figure 1

We conducted fine-mapping analyses using GWAS summary statistics from 100,204 cases and 154,587 controls. All 205 genetic variants were aggregated to 143 risk regions containing at least a 1 megabase (Mb) interval centered on the most significant association. This study focused on 142 risk regions located on the autosomes. In forward stepwise conditional analysis, we included common variants (minor allele frequency (MAF) > 0.01) with associations at P < 0.05 in both populations for the trans-ancestry analysis and with associations at P < 1 × 10−4 in each population for race-specific analysis. The threshold of conditional P < 1 × 10−6 was used to determine independent risk-associated signals. For credible causal variants (CCVs) for each independent signal, we conducted in-silico analyses with functional genomic data generated in CRC-related tissues/cells and colocalization of expression/methylation quantitative trait loci (e/mQTL) with GWAS signals to identify putative target genes for CCVs using the Summary-data-based Mendelian Randomization (SMR) approach.

We used forward stepwise conditional analyses to identify independent association signals in each region in each population, conditioning on the most significant association from the trans-ancestral summary statistics (Supplementary Fig. 1, Methods). We then meta-analyzed the conditioned data using the fixed-effects inverse variance weighted model. We considered the threshold of conditional P < 1 × 10−6 to determine independent significant associations to balance both Type 1 and 2 errors, as recommended by a previous fine-mapping study in breast cancer12. At this threshold, we identified 171 independent association signals in 122 regions (Fig. 2, Supplementary Data 3). To identify possible ancestry-specific association signals, we conducted similar analyses using only summary statistics from each population, conditioning on the ancestry-specific most significant association. Using the same threshold, we identified 198 and 45 independent association signals in European and East Asian descendants, respectively (Supplementary Data 4 and 5). Of them, 60 signals in European and 7 in East Asian were not detected in the trans-ancestry analysis above, suggesting them as potential ancestry-specific risk signals (Fig. 2).

Fig. 2: Independent association signals for colorectal cancer risk.
figure 2

Numbers of fine-mapping regions and numbers of independent association signals identified through forward stepwise conditional analyses. The second bar for “Trans-ancestry”, “European” and “East Asian” also shows the number of regions with 1, 2, or 3+ signals per region. The green color indicates the number of independent association signals previously reported or not yet reported. The blue color indicates the number of independent associaiton signals in each risk region.

In total, we identified 238 independent association signals either from trans-ancestry or ancestry-specific analysis at these 142 regions (Fig. 2). A total of 94 regions (66.2%) contained only a single association signal, while the remaining 48 regions (33.8%) consisted of multiple independent association signals. Among the 238 independent association signals, 191 signals had lead variants that were correlated with previously GWAS-reported risk variants7 (LD r2 > 0.1 in either of East Asian or European-ancestry population). The remaining 47 independent signals (19.7%) have not been previously reported, including 18 from trans-ancestry, 28 from European-specific, and one from East Asian-specific analyses (Fig. 2, Table 1). Among these 47 signals, 31 demonstrated significant associations with conditional P < 1 × 10−7, including 28 signals reached genome-wide significance.

Table 1 Independent association signals uncovered at known CRC risk loci in conditional analyses (conditional P < 1 × 10−6)

Identification of credible causal variants (CCVs) for independent association signals

To identify CCVs for each independent association signal, we conducted conditional analysis with adjustment of the lead variants for other signals in the same risk region. We conducted this analysis for trans-ancestral independent signals separately for each population to account for differences in the LD structure and then meta-analyzed conditioned results. Using a similar approach conducted in breast cancer12, we defined variants as CCVs if they satisfied conditional P values within two orders of magnitude of the most significant association, conditioning on all other independent association signals. We identified a total of 5741 CCVs for the 238 signals, with the number of CCVs per signal ranging from 1 to 249 (median: 11 CCVs per signal) (Supplementary Data 6). For 28 risk signals, only a single CCV was identified, suggesting that these CCVs are likely to be the causal variants for these signals (Table 2).

Table 2 Independent association signals with a single CCV

For the 138 independent association signals identified in both trans-ancestry and European-ancestry specific analyses (Supplementary Data 7), trans-ancestry analyses identified a smaller-sized set of CCVs (mean = 23.2, median = 8.5), compared with European-ancestry specific analysis (mean = 31.08, median = 15) (paired Wilcoxon test, P = 4.9 × 10−7). Interestingly, a single CCV was identified for 10 signals in trans-ancestry analysis, while multiple CCV for them in European-ancestry specific analysis, highlighting the value of using multi-ancestry data to reduce the number of CCVs in fine-mapping analysis. For instance, signal 1 in region_42 included 16 CCVs in the European set (lead variant: rs41302867), but only one variant in the trans-ancestry set (rs9379084). The variant rs9379084 is a predicted-deleterious missense variant (p.Asp1171Asn) of the RREB1 gene which plays a regulatory role in Ras/Raf-mediated cell differentiation13, a pathway well known to be implicated in CRC development.

Identification of target genes for CCVs

Of the 5741 CCVs identified in this study, 3716 (64.7%) are located in regions with at least one of six genomic features (open chromatin, transcribed regions of active genes, promoter, enhancer, repressed gene regulatory elements, and transcription factor (TF) binding sites) (Supplementary Data 6 and 8). To identify putative target genes of these CCVs, we used functional genomic data generated in CRC-related tissues/cells to conduct in-silico analyses with a modified INQUISIT pipeline12 (Methods, Supplementary Data 9). We identified 72 putative target genes via CCVs located in distal enhancer elements (Supplementary Data 10), 48 genes via CCVs located in proximal promoter elements (Supplementary Data 11), and 19 genes that could be targeted by CCVs in coding regions (i.e., deleterious missense, stop_gained, and start_lost) (Supplementary Data 12). In total, we identified 128 genes associated with CCVs for 76 independent association signals, with a range from one to five putative target genes per signal. Of them, 52 independent association signals contain only a single putative target gene.

We also conducted cis-expression quantitative trait loci (cis-eQTL) analyses to identify target genes using four transcriptome datasets derived from either normal colon tissues or tumor-adjacent normal colon tissues from 1299 individuals from the Genotype-Tissue Expression (GTEx) project (n = 368 individuals predominantly of European ancestry), the BarcUVa-Seq project (n = 423 individuals of European ancestry), the Colonomics project (n = 144 individuals of European ancestry), and the Asia Colorectal Cancer Consortium (ACCC) (n = 364 individuals of East Asian ancestry) (Methods). At Bonferroni-corrected P < 0.05, we identified 153 genes associated with the lead variants, including 127 genes in 65 independent association signals and 30 in 15 signals identified from trans-ancestry and European-ancestry specific analyses, respectively. We also identified the PPP1R21 gene in a potential Asian-specific risk signal (lead variant rs77272589) (Supplementary Data 13). Out of the 153 genes, 37 had been previously identified by eQTL analysis5,10,11. For independent association signals identified in European and trans-ancestry analyses, we further performed cis-methylation quantitative trait loci (cis-mQTL) analyses using two methylation datasets generated from 321 individuals from the GTEx project (n = 189 individuals predominantly of European ancestry) and the Colonomics project (n = 132 individuals of European ancestry). We found that DNA methylation levels at CpG sites for 84 genes were associated with 71 independent association signals, including 14 genes identified in previous mQTL analysis11 (Supplementary Data 14).

We next conducted colocalization analyses for identified likely target genes in significant eQTL/mQTLs above using the Summary-data-based Mendelian Randomization (SMR) approach (Methods). Through the integration of eQTL/mQTL results and GWAS associations signals, we identified 205 genes at Bonferroni-corrected PSMR < 0.05 (Supplementary Data 1519), including 150 genes from the eQTL analysis and 84 genes from the mQTL analysis. Of these, 45 (21.9%) genes were also identified as targets of CCVs by in-silico analyses based on functional genomic data as described above, and 29 genes were identified in both mQTL and eQTL analyses. That is in line with previous observations in the overlap fraction between mQTL and eQTL14. We considered genes with evidence of only mQTL colocalization, as the enrichment of mQTLs in gene regulatory elements, as well as their implications in other molecular phenotypes, such as chromatin accessibility14,15. Notably, of the 55 genes only identified in the mQTL analysis, seven genes were supported by the above in silico analyses with functional genomic data, and 22 genes showed association with CRC risk in previous TWAS and eQTL colocalization analysis7,11,16,17.

In total, we identified 288 putative target genes for 140 independent association signals based on functional genomics data and/or colocalization analysis. For 35 of these signals, multiple target gene candidates were detected per signal, suggesting that some may be false positives (Supplementary Data 20). To minimize false positive findings, we further prioritized target gene candidates by analyzing associations of genes with CRC risk based on previous transcriptome-wide association studies (TWAS) and colocalizations between eQTL and CRC GWAS signals7,11,16,17 (Methods). Finally, we obtained a credible set of 136 protein-coding genes for 124 independent association signals. Among them, 56 genes were not previously identified as potential targets for CRC risk associations, including nine genes in eight previously unreported association signals in this study (Table 3). The remaining 80 genes were previously reported as potential CRC susceptibility genes, and our study provided additional supporting evidence (Table 4)7,11,16,17.

Table 3 The 56 CRC susceptibility gene candidates not previously reported
Table 4 The 80 previously reported CRC susceptibility genes supported in this study

Using scRNA-seq data to evaluate gene expression pattern by cell types

To investigate potential underlying cell types of putative susceptibility genes that contribute to CRC development, we analyzed single-cell RNA-seq (scRNA-seq) datasets from normal colon tissues obtained from 31 participants included in the Colorectal Molecular Atlas Project18 (Methods). Of the 136 identified genes, 17 genes exhibited significantly differential expression in specific cell types compared to the other cell types at |log2 fold change (FC)| > 1 and a nominal P < 0.05 (Supplementary Data 21). Nine of these genes (DIP2B, CIB1, HPGD, CDKN2B, TMEM258, MYL12A, MYL12B, CDKN1A, and TMBIM1) showed a distinct expression pattern in specific absorptive cells (ABS) cell, underscoring the relevance of this cell type underlying CRC development.

Using whole exome sequencing data to evaluate pathogenic variants in target genes with CRC risk

We used whole exome sequencing data from 3362 CRC cases and 133,742 controls of European ancestry in the UK Biobank (UKBB) to evaluate the association of CRC risk with putative candidate genes identified our study using burden tests by aggregating either loss of function (pLOF) or pLOF and deleterious missense variants (Dmis) jointly in each gene (Methods). Of these 136 genes, MLH1 was significantly associated with CRC risk with P = 1.35 × 10−7 when considering only pLOF in tests (at Bonferroni-corrected threshold, 0.05/136 testing). Additional nine genes (TNFSF18, LRP1, SMAD9, PDGFB, CIB1, STK39, IGFBP3, FUT2, and FUT3) showed nominal P < 0.05 significance considering only pLoF or combination of pLoF and Dmis, whereas no significance was detected for the remaining genes.

Biological significance of the target genes for CCVs

We utilized Enrichr19,20,21 to analyze multiple pathway databases and identify enriched biological pathways among the 136 credible target genes (Methods). At a false-discovery rate (FDR) < 0.05, 126 pathways showed significant enrichment (Supplementary Data 22). Our findings were in line with our prior study18 and highlighted the enriched signaling pathways such as TGF-β, BMP, Wnt, Hippo, and TNF-α/NF-κB, which are known to play a crucial role in the development and progression of colorectal cancer19,20. Of the 56 genes not previously reported, nine genes (TGIF1, CDKN2B, MYC, BMP7, WNT7B, PRICKLE2, LGR6, CEBPB, and IRS2) were mapped to these pathways (Table 5). Additionally, we identified several significant pathways, including those related to cancer, pluripotency of stem cells, epithelial–mesenchymal transition, extracellular matrix organization, adipogenesis, senescence, and autophagy in cancer. Interestingly, we also identified the glycolysis pathway, which provides energy support for cancer cells, as a significant pathway not previously reported. Four previously unreported genes, GOT1, IGFBP3, IRS2, and LCT, were mapped to glycolysis, supporting their association with CRC risk.

Table 5 Significant enrichment in biological pathways

In addition, we performed functional annotation analysis on each credible target gene and assigned them to previously described cellular processes18 (Supplementary Fig. 2). Of the 56 genes not previously reported, 26 were found to be involved in these cellular processes. Specifically, five genes were related to stemness/differentiation, one gene was linked to adhesion/migration, and six genes were associated with proliferation. Interestingly, we also identified an additional cellular process, post-translation modifications (PTMs) of protein, which included three genes (DACF12, USP12, and SENP8). These findings suggest potential critical roles of PTMs in the development of CRC.

Discussion

Our study, including approximately 254,000 individuals of East Asian and European ancestry, represents the largest study conducted to fine-map CRC risk-associated genomic regions using GWAS data. We identified 238 independent association signals at conditional P value < 1 × 10−6, including 47 signals not reported previously. Furthermore, integrating functional genomic data and results from cis-eQTL/mQTL and colocalization analyses, we identified 136 putative CRC susceptibility genes, including 56 genes that had not been previously reported. Notably, these identified genes are significantly enriched in several major CRC signaling pathways and other cancer-related pathways. Our findings not only significantly expanded the number of associated signals for CRC, but also provide substantial data to advance our understanding of CRC biology.

The integration of comprehensive functional genomic data from relevant colon tissues and cell lines, as well as genetic associations data, can facilitate the identification of potential target genes for CRC risk. Our study significantly extends previous efforts7,11,16,17 by identifying 56 target gene candidates not previously reported for CRC risk, over half of which (29/56, 51.8%) are involved in the enriched biological pathways. For instance, eight target genes (TGIF1, CDKN2B, LGR6, MYC, PRICKLE2, WNT7B, BMP7, and TBX3) identified in this study may regulate normal intestinal homeostasis as they play roles in signaling pathways (i.e., Wnt and BMP) and pluripotency of stem cells. LGR6, for instance, is part of a G-protein-coupled receptor family and marks stem cells in the epidermis22. It activates a novel β-catenin/TCF7L2/LGR6-positive feedback loop in LGR6high cervical cancer stem cells (CSCs) to enhance the properties of cancer stem cells, including self-renewal, differentiation, and tumorigenicity23. Silencing of LGR6 resulted in the inhibition of stemness by repressing Wnt/β-catenin signaling in ovarian cancer24. TBX3, a transcriptional repressor, regulates stem cell maintenance by controlling stem cell self-renewal and differentiation, and reduced expression levels of TBX3 are associated with reduced pluripotency of stem cells25,26. MYC and WNT7B are implicated in the signaling related to the self-renewal and differentiation of cancer stem cells27. Here, we linked MYC and WNT7B with credible causal variants of CRC risk associations through functional genomic interaction. Our findings also indicated the relevance of glycolysis to CRC risk associations, a metabolic pathway critical in early CRC tumorigenesis by supporting the energetic and biosynthetic demands of CRC cells28,29. It should be noted that future studies are needed to validate chromatin interactions between identified CCVs and their target genes in this study by employing chromatin conformation capture technology such as in situ Hi-C, Capture Hi-C (CHi-C), and HiChIP.

Additional evidence supports some of the candidate target genes identified in our study as possible CRC susceptibility genes. In our differential gene expression analysis among normal colon mucosa, adenoma, and adenocarcinoma using gene expression data from 135 normal colon mucosas, 218 colon adenomas, and 2760 colon adenocarcinomas, we observed that 26 genes showed significant differential expression between adenoma and normal colon tissues, while 31 genes showed significant differential expression between carcinoma and adenoma tissues (adjusted P < 0.05) (Supplementary Data 20). Interestingly, three stemness/differentiation-related genes, including LRRC34, CEBPB, and TBX3, showed significant changes in their expression levels in adenoma compared to normal colon mucosa. Additionally, 34 (60.7%) of not previously identified genes have been implicated in cancer-related functions in in vitro or in vivo functional experimental studies in CRC or other cancer types (Supplementary Data 20). These results provide further evidence supporting the potential involvement of these genes in CRC progression. Despite the above supportive evidence, it remains necessary to evaluate the functions of identified putative CRC susceptibility genes through both in vitro and in vivo assays in future investigations.

The trans-ancestry and ancestry-specific fine-mapping analyses conducted in this study not only enabled the discovery of independent association signals that are shared across populations of European and East Asian ancestry, but also revealed ancestry-specific signals. The larger sample size of the European-ancestry study enabled us to identify a larger number of independent association signals than the study conducted on Asians. However, there are some ancestry-specific signals identified in this study, which is most likely due to differences in LD structures and allele frequency between these two populations. Indeed, we observed distinct differences in the allele frequency for most ancestry-specific signals, as shown in Supplementary Data 4 and 5. For instance, the lead variant of 24 European ancestry-specific signals (40%, 24/60) is not detected among East Asian-ancestry populations. On the other hand, fine-mapping analyses capitalizing on ancestry differences in LD structure can substantially reduce the credible set size compared to European-ancestry specific analysis. This highlights the value of multi-ancestry fine-mapping over single-ancestry analysis. Our analysis is limited to two ancestry groups. Further studies should increase the diversity of genetic data, including those from other racial groups.

In summary, our large trans-ancestry fine-mapping analysis has identified large numbers of not previously reported independent association signals for CRC risk and refined the majority of the previously reported association signals. By leveraging data from two ancestries, we further defined putative causal variants underlying CRC risk signals. Our study has also uncovered a credible set of target genes. These findings offer a significant advancement in our understanding of the genetic and biological processes underlying CRC and provide a roadmap for further investigation of variants and genes identified in our study.

Methods

GWAS data and meta-analysis

The GWAS data used in this study comprised 100,204 CRC cases and 154,587 controls (Supplementary Data 1), which were grouped into 31 GWAS analytical units based on the study or genotyping platform as consistent with the original reports. Of them, 17 datasets were derived from populations of European descent and 14 were from populations of Asian descent. These 31 GWAS datasets were meta-analyzed under the fixed-effects inverse variance weighted model implemented in METAL30. Further details regarding each analytical unit and meta-analysis were described in Supplementary Note.

Identifying independent association signals

A total of 205 independent genetic associations have been reported for CRC risk by GWAS7. To define fine-mapping regions for CRC, we aggregated these risk variants using bedtools. Specifically, we identified 1 megabase (Mb) intervals centered on the risk variants, and if there were regions of overlap, we combined them into a single interval over 1 Mb. In total, we determined 143 fine-mapping regions, including 142 on autosomes and one on chromosome X (Supplementary Data 2). Our fine-mapping analysis and downstream analyses focused on the 142 genomic risk regions on autosomes.

To identify distinct association signals within each risk region, we conducted a forward stepwise conditional analysis for summary statistics from the trans-ancestral meta-analysis, using GCTA-COJO31,32. We included common variants (MAF > 0.01) with associations at P < 0.05 in both populations. To account for differences in the LD structure, we conducted conditional analysis in each population for each fine-mapping region, conditioning on the most significant association from the trans-ancestral summary statistics. We then meta-analyzed the conditioned results using the fixed-effects inverse variance weighted model with METAL. To identify potential ancestry-specific independent signals, we also performed conditional analysis in each population, conditioning on the ancestry-specific most significant association. Common variants (MAF > 0.01) with association at P < 1 × 10−4 in each population were included. For LD estimation, we used genotyping data from 6684 unrelated samples of Asian descent33, and 503 European samples in the 1000 Genome project as the reference.

Following a previous study conducted for breast cancer12, we applied the conditional P value < 1 × 10−6 to define the independent signal. For each region, we first adjusted for the most significant association and then added any additional variant that remained an independent signal at the conditional P value < 1 × 10−6 to the conditional set. We then repeated the conditional analysis until no more variants met the significance threshold. In regions with multiple independent signals, we determined the index variant for each signal through a process of conditional analysis, adjusting for the index variants of the other signals. This process was repeated until the set of index variants were stabilized. The variant with the strongest residual association was defined as the index for the signal.

For independent association signals identified in ancestry-specific analyses, we compared them with those from trans-ancestry analyses by assessing correlations between their lead variants within each risk region. If a signal was consistently found in both ancestry-specific and trans-ancestry analyses (i.e., the same lead variant or correlated lead variants with LD r2 > 0.1 in each corresponding population), we considered it as a sharing signal between Asian and European-ancestry populations. Otherwise, they were defined as ancestry-specific signals.

Identifying a set of CCVs of each independent signal

To determine the CCVs of each independent signal, we used the approach described in a previous study for breast cancer12. Specifically, variants that have a conditional P value within two orders of magnitude of the most significant association, conditioning on all other independent association signals, were defined as CCVs.

RNA-seq data analysis

We conducted mRNA sequencing on tumor-adjacent normal colon tissues obtained from 364 East Asians patients with colorectal cancer who participated in the ACCC. Furthermore, we included RNA-seq data from normal colon tissues from 423 individuals of European ancestry who participated in the BarcUVa-Seq project. Included subjects, library preparation and sequencing of colon tissue samples in the ACCC and the BarcUVa-Seq project have been presented in Supplementary Note.

The raw RNA-seq data were processed according to the pipeline of the GTEx Consortium. Sequencing reads were aligned to the reference genome GRCh37 (RNA-seq data from East Asians) or GRCh38 (RNA-seq data from the BarcUVa-Seq project) with STAR (v2.5.4)34. Quality control of aligned samples was performed using RNA-SeQC (v2.3.5)35. Samples that met any of the following criteria were removed: (1) <10 million mapped reads; (2) read mapping rate < 0.2; (3) intergenic mapping rate >0.4; (4) base mismatch rate >0.01 for read mate 1 or >0.02 for read mate 2; and (5) rRNA mapping rate >0.3. If the sample had replicated RNA-seq data, the one with the highest mapped reads was retained.

Gene-level expression quantification was performed using RNA-SeQC based on the GENCODE release 19 annotation (for RNA-seq data from East Asians) and the GENCODE release 26 annotation (for RNA-seq data from the BarcUVa-Seq project)36. The read counts and TPM values of genes were calculated using aligned reads with the following criteria: (1) reads were uniquely mapped; (2) aligned reads were properly paired; (3) the read alignment distance was <6. The genes with expression thresholds of ≥0.1 TPM in ≥20% of samples and ≥6 reads (unnormalized) in ≥20% of samples were selected. Quantile normalization of the gene expression was performed. We further performed rank-based inverse normal transformation for the expressions of genes across samples.

Cis-expression/methylation quantitative loci (cis-eQTL/mQTL) analysis

To identify target genes, we performed cis-eQTL analysis based on a linear regression framework10,11. Gene expression data from four expression datasets comprising a total of 1,299 individuals were used: 1) GTEx project of transverse colon tissues from 368 individuals predominantly of European ancestry, 2) Colonomics project of normal colon tissues or tumor-adjacent normal colon tissues from 144 individuals of European ancestry, 3) BarcUVa-Seq project of normal colon tissues from 423 individuals of European ancestry, and 4) ACCC of tumor-adjacent normal colon tissue from 364 CRC patients of East Asian ancestry. We obtained available cis-eQTL results for CCVs and their nearby genes (within 1 Mb to CCV) from the GTEx database (version 8) and the Colonomics project. Details for gene expression data and eQTL analysis in the Colonomics project are explained elsewhere37. For the analyses using the remaining two datasets, we conducted a linear regression analysis to assess the associations between CCV and the normalized expression levels of nearby genes (within 1 Mb to CCV), adjusting for age, gender, and five top principal components.

We conducted cis-mQTL analysis for CCVs identified in European and trans-ancestry analyses. To do this, we included methylation data obtained from a total of 321 individuals. These datasets consisted of 189 transverse colon tissues predominantly of European ancestry from GTEx, as well as normal colon tissues or tumor-adjacent normal colon tissues of 132 individuals of European ancestry from the Colonomics project. We extracted cis-mQTL results for CCVs and their nearby CpG sites (within 1 Mb to CCV) from the GTEx database (version 8)14. In the Colonomics project, a linear regression analysis was used to evaluate the associations between CCV and the normalized methylation levels of CpG sites (within 1 Mb to CCV), with adjustments of age, gender, and colon sites (right/left). Further details about the cis-mQTL in the Colonomics project can be found in previous studies37,38.

Meta-analysis of cis-eQTL/mQTL results

We performed a meta-analysis to integrate the summary cis-eQTL/mQTL results based on beta and p values from different datasets10,11. In brief, we calculated the z score from function qnorm(p/2)*sign(beta) and further converted the standard z score derived from sum(z*sqrt(N))/sqrt(sum(N)) with a normalized weighted sampled size. Here, beta and p value were derived from eQTL/mQTL results and N referred to the sample size for each dataset. The meta p value was derived from the standard z score. For independent signals detected in both European and Asian populations, the eQTL results from both populations were combined.

We adjusted the combined p-values of eQTL/mQTL results with the Bonferroni procedure. The procedure was conducted for index variants of independent association signals. The Bonferroni-adjusted P < 0.05 was applied to identify potential target genes for each signal.

Colocalization analyses between GWAS association signals and eQTL/mQTL signals

To identify putative target genes, we employed the SMR method to conduct a colocalization analysis39. We integrated GWAS summary statistics of CCVs and their associations with genes from eQTL/mQTL analysis described above. The results of meta-analyses on cis-eQTLs/mQTLs were used. Specifically, we have a statistic:

$${T}_{{SMR}}={b}_{{xy}}^{2}/{Var}\left({b}_{{xy}}\right)\, \approx \, \frac{{Z}_{{zy}}^{2}{Z}_{{zx}}^{2}}{{Z}_{{zy}}^{2}+{Z}_{{zx}}^{2}\,}$$
(1)

Here, Zzx and Zzy are the Z statistics for the GWAS summary statistics and the cis-eQTL/mQTL results, respectively. TSMR is the χ2 statistic, which tests the significance of bxy. The significant colocalized signals were determined based on the threshold of the Bonferroni-corrected PSMR < 0.05 within each independent signal.

Functional annotation of CCVs

We investigated whether each potential causal variant was mapped to gene regulatory regions (i.e., promoter or enhancer) (Supplementary Data 8). We obtained 351 chromatin immunoprecipitation sequencing (ChIP-seq) peak files for histone modification marks and transcription factors, and 25 DNase I hypersensitive sites sequencing peak files for chromatin accessibility, generated in normal colorectal epithelium and CRC cell lines from the Cistrome database40,41. Only peaks that met all six quality controls set recommended by Cistrome were analyzed. Additionally, we obtained available ChIP-seq data of histone modification marks from colon tissues, tumor tissues of CRC, and CRC cell lines from Gene Expression Omnibus (GEO), which included 16 from GSE13392842, 215 from GSE13688943 and 233 from GSE15661344. To generate coverage tracks Bigwig (bw) files for ChIP-seq data, we converted them to bedGraph files and then identified peaks with the subcommand bdgpeakcall from macs245. For each variant, we examined whether it was mapped to a peak region of histone modification marks, DNase I hypersensitive, or transcription factors binding sites using an in-house script.

In silico prediction of regulatory element-to-gene

Since the majority of the CCVs are located outside protein-coding regions, genes can potentially be regulated by CCVs located in distal enhancer elements and proximal promoter elements. Hence, we identified an extensive set of functional genomic data from normal colon tissues or tumor tissues of colorectal cancer or colorectal cancer cell lines (Supplementary Data 9). Subsequently, we conducted an in-silico analysis for each CCV-gene pair.

We used a variety of experimental and computational functional genomic data to identify target genes of CCVs in regulatory elements. Specifically, for distal regulatory elements, we utilized chromatin-chromatin interaction data from experiments or computational predictions. To do this, we downloaded 13 experimental chromatin-chromatin interaction datasets under accessions GSE13392842 and GSE13662943 from GEO, as well as two promoter capture Hi-C datasets from the previous study46. We combined this data with ChIP-seq data of the histone modification H3K27ac (an active enhancer mark) to identify enhancer-promoter loops. We defined these loops as interactions where one fragment overlapped an H3K27ac peak (enhancer-like) and the other fragment overlapped the promoter of a gene (the region from downstream 1 kb to upstream 100 bp around the transcription start site).

In addition to this, we downloaded experimentally confirmed enhancer-gene pairs from the ENdb database. We also obtained computational enhancer-promoter interactions from IM-PET47, FANTOM548,49, EnhancerAtlas50, and super-enhancer51,52. To further refine our analysis, we included topologically associating domain (TAD) boundaries in three colorectal cancer cell lines (HT29, LoVo, and DLD1)46,53. Finally, we examined the overlap between CCVs and enhancer elements. For proximal promoter elements, we analyzed CCVs located within gene promoter regions that intersected with ChIP-seq peaks of H3K4me3 (an activity promoter mark).

To identify potential loss-of-function variants and their corresponding targeted genes, we conducted variant annotation of CCVs using the Variant Effect Predictor (VEP) tool54. To predict the consequence of missense coding variants, we utilized PolyPhen-2 and SIFT. Furthermore, to evaluate splicing effects, MaxEntScan was used.

We scored CCV target genes using different criteria (Supplementary Data 9). For the potential target gene of CCV in distal enhancer elements, the gene was awarded two points or one point if there was evidence from experimental chromatin-chromatin interaction or computed interaction. The score was unweighted to three if both experimental and computational interaction were detected for the gene-CCV pair. If CCV interacted with genomic features (open chromatin, activity enhancer, and TF binding sites), the corresponding gene was further unweighted by one point. An additional point was awarded if there are at least two interactions for the CCV. If the gene were colorectal cancer or pan-cancer drivers55, they were up-weighted by an additional point. The score was down-weighted for the gene if the CCV-gene pair was separated by TAD or a lack of expression in colon tissues. Distal scores eventually ranged from 0 to 6. For the potential target gene of CCV in proximal promoter elements, the gene was awarded one point if CCV overlapped with binding sites of transcription factors. If genes were colorectal cancer or pan-cancer drivers, they were up-weighted with an additional point. A lack of its expression resulted in down-weighting to 10% as target genes. Proximal scores eventually ranged from 0 to 2. Genes predicted to be regulated targets of coding CCVs were awarded points based on the annotation as either of missense, nonsense, and predicted splicing alterations. The consequences of missense variants which probably are damaging or deleterious resulted in the addition of one point to the target gene. Further points were awarded to such a gene if it was colorectal cancer or pan-cancer drivers. A lack of expression reduced the score (the score was down to 10%). Coding scores ranged from 0 to 2. For the set of confident target genes, we defined such genes if it has a distal score >4 or a proximal score >1, or a coding score >1.

Credible set of susceptibility genes

To determine a set of credible genes for CRC susceptibility, we combined information on gene-CRC risk associations through TWAS and colocalization of eQTL signal with GWAS risk signals for genes that were present in both our study and previous investigations. We used three sets of previously identified genes below: (A) 155 effector genes identified through GWAS, TWAS, TIsWAS, and MWAS7; (B) 136, 26, and 48 genes identified through TWASs7,16,17; (C) 73 genes identified through colocalization analysis between eQTL and GWAS signals11 or genes associated with CRC risk at nominal P < 0.05 in the previous TWAS17. We considered the prioritization order as A > B > C for these three gene sets and focused on protein-coding genes outside the MHC region. For the independent association signals with multiple target gene candidates, we kept either genes with higher prioritization or all genes if there was no evidence from these three gene sets. For the independent association signals with a single gene, we kept it regardless of evidence from the gene sets.

Single-cell RNA-sequencing data analysis

We included single-cell RNA-sequencing datasets from colon tissues of 31 individuals who participated in the Colorectal Molecular Atlas Project (COLON MAP)18. We analyzed gene expression dataset for each individual’s cell and combined these datasets into a count matrix. We normalized the number of unique molecular identifiers (UMIs) per cell and converted it to transcripts per 10,000 transcripts (TP10K). Next, we applied a logarithmic transformation to the normalized values and got the log2(TP10K + 1) expression matrix for the downstream analyses. Further, we determined the 2000 most highly variable genes within the entire dataset and performed a principal component analysis (PCA). The top 30 and 40 principal components (PCs) were identified. Subsequently, we performed batch correction removal and utilized the top 40 batch-corrected components to construct a k-nearest neighbors graph of cell profiles with k = 9. We visualized the individual single-cell profiles using the Uniform Manifold Approximation and Projection (UMAP) and constructed the neighborhood graph using the Leiden graph-clustering method. Nine cell types were defined, including well-known major cell types such as absorptive cells (ABS), crypt top colonocytes (CT), enteroendocrine cells (EE), goblet cells (GOB), stem cells (STM), and others. We identified differentially expressed genes (DEGs) by comparing each cell type with all other cell types and calculated a P-value for each gene using Wilcoxon’s rank-sum test. The criteria |log2 fold change (FC)| > 1 and P < 0.05 were applied to determine genes with significantly differential expression between cell types.

Burden test for credible susceptibility genes

We annotated all variants in the UKBB WES 200 K cohort with functional annotations from ANNOVAR56 based on the reference genome GRCh38. We only included rare loss-of-function (LoF) and deleterious missense (Dmis) variants with MAF < 0.01 in our gene-based test. LoF variants were those predicted as frameshift insertion/deletion, splice-site alteration, stop gain, and stop loss by ANNOVAR, and deleterious missense (Dmis) variants were those predicted as deleterious by MetaSVM57. We considered both LoF sets and damaging sets (LoF+ Dmis) within a gene for testing. For a given set, we collapsed rare variants within a gene as a single combined ‘mask’ and tested the association between the ‘mask’ genotype and the CRC phenotype using logistic regression after adjusting for sex, age, the interaction of sex and age, and the top four principal components.

Pathway analysis of credible susceptibility genes

To explore the potential biological roles of the identified CRC susceptibility genes, we analyzed their functional enrichment using the enrichR19,20,21 and various pathway databases, including WikiPathway, KEGG, MSigDB, and Reactome. The biological pathways (adjusted P < 0.05) were considered and presented.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.