5. Interpreting variation: GWAS, cancer, genotype, evolution and allelic

doi:10.1038/nature14313

Thread
Published: 18 February 2015

5. Interpreting variation: GWAS, cancer, genotype, evolution and allelic

Nature (2015)Cite this article

1728 Accesses
3 Altmetric
Metrics details

Reference epigenomes provide leads about the cell-type specificity and molecular basis of natural and disease-associated genetic variants

Complex trait variants are enriched in diverse epigenomic marks

Integrative analysis of 111 reference human epigenomes.

Roadmap Epigenomics Consortium et al.Nature 10.1038/nature14248

For enhancer-associated H3K4me1 peaks, we found 58 studies (Fig. 9a, Extended Data 11a) with significant enrichments in at least one tissue at 2% FDR (Hypergeometric P<10^-3.9). Upon manual curation, the enriched cell types were consistent with our current understanding of disease-relevant tissues for the vast majority of cases. For example, diverse immune traits were enriched in immune cell enhancers, including rheumatoid arthritis, celiac disease, type 1 diabetes, systemic lupus erythematosus, chronic lymphocytic leukemia, allergy, multiple sclerosis, and Graves’ disease ^75-81. A large number of metabolic trait variants are enriched in liver enhancer marks, including LDL, HDL, total cholesterol, lipid metabolism phenotypes, and metabolite levels ^82,83. Fasting glucose was most enriched for pancreatic islet enhancer marks, and insulin-like growth factors in placenta, consistent with their endocrine regulatory roles ^84,85. Several cardiac traits were enriched in heart tissue enhancers, including the PR heart repolarization interval, blood pressure, and aortic root size. Interestingly, inflammatory bowel disease and ulcerative colitis variants show enrichment in both immune and gastrointestinal enhancer marks, suggesting dysregulation of both organs may underlie disease predisposition. Both attention deficit hyperactivity disorder and adiponectin levels were enriched in brain regions, consistent with causal roles in brain dysregulation ^86,87. In contrast, late-onset Alzheimer's disease variants were enriched in immune cell enhancers, rather than brain, consistent with recent evidence of a possible immune and inflammatory basis ^88-90.

Utilizing epigenomic annotations to gain insights into Alzheimer’s-disease-associated loci

Conserved epigenomic signals in mice and humans reveal immune basis of Alzheimer’s disease.

Gjoneska, E. et al.Nature 10.1038/nature14252

We next utilized the epigenomic annotations of increased-activity enhancer orthologs to gain insights into AD-associated loci (Supplementary Table S7). Among the 20 genome-wide significant AD-associated loci4, 11 contain no protein-altering SNPs in linkage disequilibrium (LD), indicating they may play non-coding roles. Of these, 5 localize within increased-level enhancer orthologs, including two well-established GWAS loci (PICALM, BIN1), and three loci (INPP5D, CELF1/SPI1, PTK2B) only recently recognized as significant by combining all AD cohorts.

For INPPD5 (Fig. 3a), a known regulator of inflammation²⁸, the most significant variants localize within an increased-level enhancer ortholog, which also shows CD14+ enhancer activity. In the CELF1 locus (Fig. 3b) a large region of association spans several genes, but the strongest genetic signal (p=2x10^-6) localizes upstream of SPI1 (PU.1), and specifically within an increased-level enhancer ortholog that is also active in immune cells. We confirmed that the AD-associated C-T substitution, rs1377416, in the SPI1 enhancer leads to increased in vivo enhancer activity in murine BV-2 microglia cells using a luciferase reporter assay (Fig. 3d). In addition, the AD-associated SNP rs55876153 near SPI1, which overlaps an increased-level mouse enhancer ortholog, is in strong linkage disequilibrium (LD=0.89, see Methods) with a known SPI1 eQTL, rs10838698²⁵, even though it did not significantly alter enhancer activity in the luciferase assay.

Outside known GWAS loci, an additional 22 weakly-associated regions (3.9 fold, p<4.9x10^-7) contain variants within increased-level enhancer orthologs (Supplementary Table S7), of which 17 lack protein-altering variants in LD (R²<0.4), providing strong candidates for directed experiments. One such example includes ABCA1 (p=6.9x10^-5, Fig. 3c), a paralog of AD-associated ABCA7 and encoding a glial-expressed transporter that influences APOE metabolism in the central nervous system29. The region lacks protein-altering variants and all five SNPs in the cluster of association lie specifically within an increased-enhancer ortholog, which is also active in CD14+ immune cells and, to a lesser extent, in human hippocampus and fetal brain.

Fine-mapped genetic architecture of disease

Genetic and epigenetic fine mapping of causal autoimmune disease variants.

Farh, K. K.-H. et al.Nature 10.1038/nature13835

Prior studies that have integrated GWAS with epigenomic features focused on lead SNPs or multiple associated SNPs within a locus, of which only a small minority reflects causal variants^{10,16–19,21}. Although these studies demonstrated enrichments within enhancer-like regulatory elements, they could not with any degree of certainty pinpoint the specific elements or processes affected by the causal variants. To overcome this limitation, we leveraged dense genotyping data to refine a statistical model for predicting causal SNPs from genetic data alone. Rare recombination events within haplotypes can provide information on the identity of the causal SNP, provided sufficient genotyping density and sample size. We therefore examined a cohort of 14,277 cases with multiple sclerosis and 23,605 healthy controls genotyped using the Immunochip, which comprehensively covers 1000 Genomes Project SNPs²² within 186 loci associated with autoimmunity²⁰. We developed an algorithm, Probabilistic Identification of Causal SNPs (PICS), that estimates the probability that an individual SNP is a causal variant given the haplotype structure and observed pattern of association at the locus (Methods, Extended Data Figs 1–4).

We next generalized PICS to analyse 21 autoimmune diseases, using Immunochip data when they were available or imputation to the 1000 Genomes Project²² when they were not (Methods; Supplementary Table 1). We mapped 636 autoimmune GWAS signals to 4,950 candidate causal SNPs (mean probability of representing the causal variant responsible for the GWAS signal: ~10%). PICS indicates that index SNPs reported in the GWAS catalogue have on average only a 5% chance of representing a causal SNP. Rather, GWAS catalogue index SNPs are typically some distance from the PICS lead SNP (median 14 kb), and many are not in tight LD (Fig. 1d, Extended Data Fig. 5). PICS identified a single most likely causal SNP (>75% probability) at 12% of loci linked to autoimmunity. However, most GWAS signals could not be fully resolved due to LD and thus contain several candidate causal SNPs (Fig. 1e).

To confirm the functional significance of fine-mapped SNPs, we compared PICS SNPs against a strict background of random SNPs drawn from the same loci. Candidate causal SNPs derived by PICS were strongly enriched for protein-coding (missense, nonsense, frameshift) changes, which account for 14% of the predicted causal variants compared to just 4% of the random SNPs. Modest enrichments over the locus background were also observed for synonymous substitutions (5%), 3′ UTRs (3%), and splice junctions (0.2%) (Fig. 1f). Although these results support the efficacy of PICS for identifying causal variants, ∼90% of GWAS hits for autoimmune diseases remain unexplained by protein-coding variants. Candidate causal SNPs and the PICS algorithm are available through an accompanying online portal (http://www.broadinstitute.org/pubs/finemapping).

Cell-type-specific signatures of complex disease

Genetic and epigenetic fine mapping of causal autoimmune disease variants.

Farh, K. K.-H. et al.Nature 10.1038/nature13835

Along with the 21 autoimmune diseases, we predicted causal SNPs for 18 other traits and diseases (Methods). Comparing SNP locations with chromatin maps for 56 cell types revealed the cell type-specificities of cis-regulatory elements that coincide with PICS SNPs, thus predicting cell types contributing to each phenotype (Fig. 3). The patterns are more informative than the expression patterns of genes targeted by coding GWAS hits (Extended Data Fig. 8). Notable examples include SNPs associated with Alzheimer’s disease and migraine, which map to enhancers and promoters active in brain tissues, and SNPs associated with fasting blood glucose, which map to elements active in pancreatic islets. Nearly all of the autoimmune diseases preferentially mapped to enhancers and promoters active in CD4+ T-cell subpopulations. However, a few diseases, such as systemic lupus erythematosus, Kawasaki disease, and primary biliary cirrhosis, preferentially mapped to B-cell elements. Notably, ulcerative colitis also mapped to gastrointestinal tract elements, consistent with its bowel pathology. Although the primary signature of type 1 diabetes SNPs is in T-cell enhancers, there is also enrichment in pancreatic islet enhancers (P < 10⁻⁷). Thus, although immune cell effects may be shared among autoimmune diseases, genetic variants affecting target organs such as bowel and pancreatic islets may shape disease-specific pathology.

Variably DNA methylated loci in CD34⁺ haematopoietic stem and progenitor cells

The meta-epigenomic structure of purified human stem cell populations is defined at cis-regulatory sequences.

Wijetunga, N. A. et al.Nature Communications 10.1038/ncomms6195

We used two sources of DNA methylation data, one from the Roadmap Epigenomics programme, publicly available reduced representation bisulphite sequencing (RRBS)²³ data on mobilizedCD34+ HSPCs from 7 adults, and the second generated by our group, using CD34+ HSPCs isolated from cord blood from 29 phenotypically normal neonates assayed using the HELP-tagging assay²⁴. Despite the differences in how each of these assays measures DNA methylation, both showed increased variability at loci with intermediate methylation values (Fig. 1), consistent with previous observations¹⁶.

We continued our analyses based on the HELP-tagging data,which are derived from a greater number of samples and from neonates, who have less potential for manifesting age-associated variability than adults²⁵. As HELP-tagging is based on the use of methylation-sensitive restriction enzymes²⁴, we were able to use the results from the methylation-insensitive MspI control enzyme to estimate the degree of technical variability, and a permutation analysis of the HpaII-derived data also showed enrichment of the observed variability over expected background levels (Supplementary Fig. 2a). A number of loci with differing degrees of variability were chosen for bisulphite PCR, using seven of the samples that had been tested using HELP-tagging as well as eight independent samples. These amplicons were combined for each individual and used to generate Illumina libraries, allowing targeted massively parallel sequencing of the bisulphite-converted DNA. The results confirm that DNA methylation variability isenriched at loci with variability measures above the threshold attributable to technical variability or chance (Fig. 2b).

Variable DNA methylation is enriched at functional elements

The meta-epigenomic structure of purified human stem cell populations is defined at cis-regulatory sequences.

Wijetunga, N. A. et al.Nature Communications 10.1038/ncomms6195

With the genome annotated for functional elements in a cell type specific manner, we then tested the associations between genomic annotations and the loci with increased variability in DNA methylation. In Fig. 4a, we show the strongest associations for highly variable loci within clustered SOM space to be with H3K27ac, H3K27me3 and H3K4me1. Figure 4b also shows enrichment of variability at the TSS of RefSeq genes for feature 6 and immediately upstream at feature 4, both significant at P<0.001. Figure 4c shows enrichment in variability at the proximal part of CpG island shores for feature 6 and more extensively into the CpG island shore for feature 4, both also significant at P<0.001. A complementary SOM analysis using the published ChromHMM annotations of the human genome³² reveals consistent results (Supplementary Fig. 8). DNA methylation variability is therefore enhanced at candidate cis-regulatory sequences (promoters and enhancers) and the epigenetic variability previously observed for CpG island shores ^8,31 is reflective of this general characteristic of enhancers. Common SNPs are not enriched in density in any of the features (Supplementary Fig. 9) and therefore are unlikely to be the major reason for selective enrichment of epigenetic variability in these specific genomic contexts. If the variability of DNA methylation occurs at loci with potential transcriptional regulatory properties, it raises the question of whether variability occurs selectively near genes with specific transcriptional activities. We find that all levels of expression have comparable levels of epigenetic variability at promoter sequences. Genes expressed at the lowest levels in the genome are those with selective enrichment of epigenetic variability at nearby candidate enhancers (Fig. 5), a significant inverse relationship (P=10^-8) using the Jonckheere trend test.

Using variability information to quantify cell subtypes

The meta-epigenomic structure of purified human stem cell populations is defined at cis-regulatory sequences.

Wijetunga, N. A. et al.Nature Communications 10.1038/ncomms6195

Variability of gene transcription levels in cell samples from multiple individuals has allowed patterns to be identified that predict the numbers of cell subtypes present. We adapted one of the approaches used for these transcriptional variability studies, non-negative matrix factorization (NMF)^34,35, to our DNA methylation variability data to estimate the number of cell subtypes in our purified CD34+ HSPCs. In Fig. 6, we show the NMF output to predict ~13–20 cell subtypes, consistent with the ~15 distinct types of cells that have previously been described to express the CD34 cell surface marker³³.

Colorectal-cancer risk-associated SNPs in distal regulatory regions

Functional annotation of colon cancer risk SNPs.

Yao, L., Tak, Y. G., Berman, B. P. & Farnham, P. J. Nature Communications 10.1038/ncomms6114

Most of the SNPs in LD with the CRC GWAS tag SNPs cannot be easily linked to a specific gene because they do not fall within a coding region or a promoter-proximal region. However, it is possible that a relevant SNP associated with increased risk lies within a distal regulatory element of a gene whose function is important in cell growth or tumorigenicity. To address this possibility, we used the histone modification H3K27Ac to identify active regulatory regions throughout the genome of colon cancer cells or normal sigmoid colon cells. We used HCT116 H3K27Ac ChIP-seq data ¹⁶ produced in our lab for the tumor cells and we obtained H3K27Ac ChIP-seq data for normal colon cells from the NIH Roadmap Epigenome Mapping Consortium. The ChIP-seq data for both the normal and tumor cells included two replicates. To demonstrate the high quality of the datasets, we called peaks on each replicate of H3K27Ac from HCT116 and each replicate of H3K27Ac from sigmoid colon using Sole-search ^{27, 28} and compared the peak sets from the two replicates using the ENCODE 40% overlap rule (after truncating both lists to the same number, 80% of the top 40% of one replicate must be found in the other replicate and vice versa). After determining that the HCT116 and sigmoid colon datasets were of high quality (Supplementary Figure 4), we merged the two replicates from HCT116 and separately merged the two replicates from sigmoid colon and called peaks on the two merged datasets; see Supplementary Data 2 for a list of all ChIP-seq peaks. Using the merged peak lists from each of the samples as biofeatures in FunciSNP, we determined that 746 of the 4894 SNPs that were in LD with a tag SNP at r²>0.1 were located in H3K27Ac regions identified in either the HCT116 or sigmoid colon peak sets; of these 270 SNPs had an r²>0.5 with a tag SNP (Figure 1 and Supplementary Figure 5).

A comparison of the H3K27Ac peaks from normal and tumor cells indicated that the patterns are very similar; in fact, ~24,000 H3K27Ac peaks are in common in the normal and tumor cells. However, there are clearly some peaks unique to normal and some peaks unique to the tumor cells. Therefore, we separately analyzed the normal and tumor H3K27Ac ChIP-seq peaks as different sets of biofeatures using FunciSNP (Figure 1B). Of the 746 SNPs, 236 were located in a H3K27Ac site common to both normal and tumor cells, whereas 140 were unique to tumor and 370 were unique to normal cells. Visual inspection of the SNPs and peaks using the UCSC genome browser showed that many of the identified enhancers harbored multiple correlated SNPs. Reduction of the number of SNPs to the number of different H3K27Ac sites resulted in 47 common, 41 tumor-specific, and 111 normal-specific regions. Visual inspection also showed that some of the H3K27 genomic regions corresponded to promoter regions (Supplementary Figure 4). Because promoter regions having correlated SNPs were already identified using TSS regions (see above), we eliminated the promoter-proximal H3K27Ac sites, resulting in 27 common, 32 tumor-specific, and 96 normal-specific distal H3K27Ac regions. As the next winnowing step, we selected only those enhancers having at least one SNP with an r²> 0.5, leaving 18 common, 9 tumor-specific, and 41 normal-specific distal H3K27Ac regions. We noted that some of the identified regions corresponded to low ranked H3K27Ac peaks. For our subsequent analyses, we wanted to limit our studies to robust enhancers that harbor correlated SNPs. Therefore, we visually inspected each of the genomic regions identified as having distal H3K27Ac peaks harboring a correlated SNP. To prioritize the distal regions for further analysis, we eliminated those for which the correlated SNPs was on the edge of the region covered by the H3K27Ac signal or corresponded to a very low-ranked peak. After inspection, we were left with a set of 27 distal H3K2Ac regions in which a correlated SNP (r2>0.5) was well within the boundaries of a robust peak (Figure 1B). To confirm our results, we repeated the analysis using H3K27Ac data from a different colon cancer cell line, SW480, identifying only one additional enhancer harboring risk SNPs for CRC. The genomic coordinates of each of these 28 enhancers, which are clustered in 9 genomic regions, are listed in Table 3 (see also Supplementary Table 1). Combining all data, enhancers in 5 of the 9 regions were identified in all 3 cell types and 8 of the 9 regions were identified in at least two of the cell types.

The effect of enhancer deletion on the transcriptome

Functional annotation of colon cancer risk SNPs.

Yao, L., Tak, Y. G., Berman, B. P. & Farnham, P. J. et al.Nature Communications 10.1038/ncomms6114

The expression analyses described above provide a list of genes that potentially are regulated by the CRC risk-associated enhancers. However, it is possible that the enhancers regulate only a subset of those genes and/or the target genes are at a greater distance than was analyzed. One approach to identify targets of the CRC risk-associated enhancers would be to delete an enhancer from the genome and determine changes in gene expression. As an initial test of this method, we selected enhancer 7, located at 8q24. The region encompassing this enhancer has previously been implicated in regulating expression of MYC ³¹, which is located 335 kb from enhancer 7. We introduced guide RNAs that flanked enhancer 7, along with Cas9, into HCT116 cells, and identified cells that showed deletion of the enhancer. We then performed expression analysis using gene expression arrays, identifying 105 genes whose expression was down-regulated in the cells having a deleted enhancer (Supplementary Data 5 ); the closest one was MYC, which was expressed 1.5 times higher in control vs. deleted cells (Figure 4).

Epigenomic annotation of genetic variants using the Roadmap Epigenome Browser.

Zhou, X. et al.Nature Biotechnology 10.1038/nbt.3158

Advances in next-generation sequencing platforms have reshaped the landscape of functional genomic and epigenomic research as well as human genetics studies. Annotation of noncoding regions in the genome with genomic and epigenomic data has facilitated the generation of new, testable hypotheses regarding the functional consequences of genetic variants associated with human complex traits^1,2. Large consortia, such as the US National Institutes of Health (NIH) Roadmap Epigenomics Consortium³ and ENCODE⁴, have generated tens of thousands of sequencing-based genome-wide data sets, creating a useful resource for the scientific community⁵. The WashU Epigenome Browser^6-8 continues to provide a platform for investigators to effectively engage with this resource in the context of analyzing their own data. Here, we describe the Roadmap Epigenome Browser (http://epigenomegateway.wustl.edu/browser/roadmap), which is based on the WashU Epigenome Browser and integrates data from both the NIH Roadmap Epigenomics Consortium and ENCODE in a visualization and bioinformatics tool that enables researchers to explore the tissue-specific regulatory roles of genetic variants in the context of diseases. The Browser takes advantage of over 10,000 epigenomic data sets it currently hosts, including 346 ‘complete epigenomes’, defined as tissues and cell types for which we have collected a complete set of DNA methylation, histone modification, open chromatin and other genomic datasets⁹. Data from both the NIH Roadmap Epigenomics and ENCODE resources are seamlessly integrated in the browser using a new Data Hub Cluster framework. Investigators can specify any number of SNP-associated regions and any type of epigenomic data, for which the browser automatically creates “virtual data hubs” through a shared hierarchical metadata annotation, retrieves the data, and performs real-time clustering analysis. Investigators interact with the Browser to determine the tissue specificity of the epigenetic state encompassing genetic variants in physiologically or pathogenically relevant cell types from normal or diseased samples.

We illustrate the epigenomic annotation of two noncoding SNPs, identified from genome-wide association studies of people with multiple sclerosis¹⁰, by clustering the histone H3K4me1 profile of SNP-harboring regions and RNA-seq signal of their closest genes across multiple primary tissues and cells (Fig. 1). Both SNPs lie within putative enhancer regions. Whereas rs307896 marks an enhancer common across cell types, rs756699 is located in an enhancer specific to immune cells and is potentially targeting TCF7, a T cell specific gene 3.8kb downstream (Fig. 1). Thus, reference epigenomes provide important clues into the functional relevance of these genetic variants in the context of the pathophysiology of multiple sclerosis, including inflammation¹¹.

Investigators can also use the browser to identify co-variation of epigenomic, transcriptomic, and transcription factor binding profiles across cell types to predict relationships between regulatory sites and target genes. Additionally, investigators can explore multiple complete reference epigenomes in different browser panels in parallel using synchronized genomic coordinates or independent genomic coordinates. A variety of Epigenome Browser functions, including gene set view, genome juxtaposition, chromatin interaction display and statistical testing, can be applied to better engage with this epigenomic resource.

We also provide the means for investigators to build their own Data Hub Clusters of different scales and clone the browser on Amazon Cloud to visualize and analyze private data in the context of public data. These tools, along with the rapidly growing epigenomic datasets of human cells of different states, will facilitate the translation of genetic signals into molecular mechanisms, leading to prognostic, diagnostic and therapeutic advances.

Allelic imbalances in gene expression

Chromatin architecture reorganization during stem cell differentiation.

Dixon, J. R. et al.Nature 10.1038/nature14222

Previous studies of allele-resolved gene expression have identified widespread allelic imbalances in gene expression between alleles^{24-27, 33}. However, it remains unclear to what degree allele-biased gene expression varies among different lineages of a single individual. To address this, we re-analyzed haplotype-resolved mRNA-seq data and identified allelic biases in gene expression across the five H1 lineages. A total of 1,787 genes showed allelic bias in gene expression in one or more lineages studied here, representing ~24% of all testable genes (FDR 10%, Figure 4a). Most allelic differences in expression are not “on/off” events, but instead reflect biases in the level of expression from each allele (Figure 4b). Further, allele-biased genes include both lineage specific and constitutively expressed genes (Extended Data Figure 6c,d), and patterns of allelic bias can also be constitutive or cell-type variable (Figure 4c,d). Only in rare cases do genes switch expression from one allele to the other between cell types. As expected, genes subject to genomic imprinting are enriched among genes with allelic biases in expression (Figure 4e), though these represent ~1% of allele-biased genes (Figure 4f). While imprinted genes often occur in clusters, the majority of allele-biased gene expression is not clustered in the genome (Extended Data Figure 6e). Taken together, these data suggest that most allelic gene expression is due to mechanisms other than genomic imprinting. One possible regulatory mechanism that could give rise to allele-biased expression would be allelic bias in activity of cis-regulatory elements near these genes. Indeed, regions of the genome that show allele bias in histone acetylation, histone methylation, CTCF, and DNase I hypersensitivity are closer to allele-biased genes than randomly selected genomic regions (Figure 4g). Furthermore, allelic gene expression is strongly correlated with DNA methylation or chromatin modification state at promoters (Figure 4h,i). Of the 247 genes that contain heterozygous variants in their promoter regions and display biased transcription in at least one lineage, a majority exhibit allele-biased chromatin modifications or DNA methylation at the promoter (Figure 4h). Interestingly, 29% of the testable genes that have allele-biased expression show no evidence of allelic bias in chromatin state or DNA methylation at the promoter (Figure 4h), raising the possibility that elements outside of promoters may be responsible for the allelic gene expression.

Motif disruption is correlated with H3K27ac variation

Predicting the human epigenome from DNA motifs.

Whitaker, J. W., Chen, Z. & Wang, W. Nature Methods 10.1038/nmeth.3065

A recent study of 19 individuals correlated sequence variation at known TF motif sites with variation in H3K27ac levels at overlapping peaks ⁴³. Kasowski et al. found that H3K27ac variation in 32,886 peaks correlated with disruption of 662 known motifs by SNPs among the 19 individuals, and significant association was found in 32% of regions (significance determined using Spearman’s rank and label permutation ⁴³). To demonstrate the power of the Epigram pipeline, we repeated the analyses done by Kasowski et al. by first running Epigram on H3K27ac, which resulted in a full model featuring 133 motifs that are predictive of H3K27ac (Online Methods). Epigrams motifs were significantly correlated in 62% of regions using a motif set that is ~20% the size of those used by Kasowski et al. (662 known motifs). Thus, Epigram discovers motifs that are significantly correlated with H3K27ac variation in 30% more regions and that represent the novel binding patterns for regulators of H3K27ac. Furthermore, Kasowski et al.⁴³ showed 20 TFs that are significantly correlated within ~4,500 variable regions, whereas the motifs from Epigram’s 20-motif model are significantly correlated within 7,006 variable regions (Fig. 5). One of the Epigram’s 20 motifs matches the known IKZF1 motif, which has been shown to target chromatin remodeling and deacetylation complexes during lymphocyte differentiation 44. In addition, we also found that 3 of these 20 motifs match motif groups identified to be associated with H3K27ac in H1, NPC, MSC and TBL. Taken together, Epigram is able to explain significantly more variants while using fewer motifs than the Kasowski et al. analysis.

The impact of chromatin on local mutation density in cancer cells

Cell-of-origin chromatin organization shapes the mutational landscape of cancer.

Polak, P. et al.Nature 10.1038/nature14221

The comparison of individual epigenomic features with local mutation density revealed that chromatin marks corresponding to the tumour’s cell type of origin are more strongly associated with local mutation density than marks corresponding to unrelated cell types. For example, DHS marks from melanocytes explained a substantially larger fraction of the variance in melanoma mutation density than DHS marks from other cell types, even from the same tissue (skin) (Fig. 1b). As another example, even though H3K4me1 marks in melanocytes and hepatocytes are highly correlated (r = 0.8), the distribution of mutations in liver cancer followed the levels of H3K4me1 in hepatocytes but not in melanocytes, whereas melanoma mutations correlated with the levels of H3K4me1 in melanocytes but not in hepatocytes (Fig. 1c).

This initial observation suggested that the impact of chromatin on local mutation density is highly cell-type specific. The comprehensive representation of different cell types in the Epigenome Roadmap could thus enable an improved prediction accuracy of mutations compared to previous studies. To rigorously quantify the contribution of different chromatin marks and gene expression to regional mutation density, and the extent of cell type specificity, we used Random Forest regression (Methods). Remarkably epigenetic marks, together with replication timing measured in ENCODE consortium cell lines¹⁸, collectively accounted for 74–86% of the variance in mutation density in seven cancer types (Fig. 2a). In glioblastoma, for which fewer mutations were available for the analysis, 55% of the variance in mutation density could be explained. This is substantially higher than in earlier studies⁴ and indicates that, at least for these cancer types, we have identified a set of epigenetic variables and cell types that almost fully predict the mutational variability along the genome. This enhanced prediction accuracy was not simply due to the larger size of the training data relative to previous studies, as the predictive ability dropped by only ~2–6% when only 10% of the data was used (Extended Data Fig. 3).

Taken together, the above results strongly suggest that the cell of origin of an individual tumour sample could be predicted from its mutation pattern alone. Mutation profiles of individual samples cluster according to cancer type, and, consequently cell of origin (Fig. 4a). We developed a straightforward predictor based on enrichment of epigenomic variables from a single cell type among the top 20 variables selected by the Random Forest analysis. This approach classified 88% of melanoma, colorectal, liver, multiple myeloma, oesophageal and glioblastoma cancer genomes to melanocytes, colonic mucosa, liver, haematopoietic, stomach mucosa and brain tissues, respectively (Fig. 4b). Thus, mutational patterns contain sufficient information for identifying the cell type of origin of a tumour. We propose that sequencing the DNA of a tumour of unknown primary origin can allow the precise pinpointing of the cell type of origin of that tumour.

Large-scale epigenome imputation improves disease variant enrichment

Large-scale imputation of epigenomic datasets for systematic annotation of diverse human tissues.

Ernst, J. & Kellis, M. NATURE BIOTECHNOLOGY 10.1038/nature14248

As epigenomic maps have recently emerged as an unbiased approach for discovering disease-relevant tissues and cell types^3,32, we also evaluated the impact of epigenome imputation on the interpretation of trait-associated variants from GWAS. We quantified the enrichment (positive or negative) of trait-associated variants from the NHGRI GWAS catalog³³ in both observed and imputed datasets for each mark. We evaluated enrichments both in aggregate across all studies, based on Area under an ROC curve up to a 5% false positive rate (AUC5%) for the signal level recovery of trait-associated SNPs, and at the level of individual studies, based on mark signal rank differences between each study's SNPs and all other SNPs in the GWAS catalog (see Methods). We evaluated both the number of studies for which there was a significant signal rank difference in at least one sample, and the total number of study-sample pairs that are significant, at varying p-value thresholds. We then compared both the number of significant studies and the number of significant pairs to the numbers obtained for randomized versions of the GWAS catalog, which also enabled us to obtain a false discovery rate estimate for each p-value threshold (Table S2, see Methods).

For all Tier-1 active marks, imputed data resulted in substantially greater recovery of SNPs in the GWAS catalog (Fig. S23) than the observed data, and more significant enrichments for both the number of studies, and the number of study-sample pairs, across all tested significance thresholds (Fig. 4a, Fig. S24-S25). In addition, the imputed data yielded a stronger enrichment for each enriched sample in the large majority of cases for nearly all marks (Fig. 4b, Fig. S26). We confirmed that the actual GWAS catalog yielded more significant associations than randomized versions, for both the observed and imputed data (Fig. 4a, Fig. S24-S25). Imputed data performance was substantially higher than that of the average mark signal across all available samples (Fig. S24b), emphasizing the increased performance was not simply due to averaging multiple samples. We also confirmed that the top most significant enriched samples for a given study were generally biologically relevant for active marks: for H3K27ac for example, we found that liver was enriched in various cholesterol phenotypes, that immune-related cells were enriched in various immune related disorders, ulcerative colitis in the colonic mucosa and many other biologically-meaningful enrichments (Fig. 4c-f, Table S2).

These results help validate the biological relevance of imputed datasets based on an orthogonal annotation source, and help illustrate imputed datasets as a potentially useful resource for interpreting GWAS results.

Figure 1: Epigenomic enrichments of genetic variants associated with diverse traits.

Tissue-specific H3K4me1 peak enrichment for genetic variants associated with diverse traits. Circles denote reference epigenome (column) of highest enrichment for SNPs reported by a given study (row), defined by trait and publication (PubMed identifier, PMID). Tissue (Abbrev) and p-value (-log10) of highest enrichment are shown. Only rows and columns containing a value meeting a FDR of 2% are shown (Full matrix for all studies showing at least 2% FDR in Extended Data 11-12).

Complex trait variants are enriched in diverse epigenomic marks

Utilizing epigenomic annotations to gain insights into Alzheimer’s-disease-associated loci

Fine-mapped genetic architecture of disease

Cell-type-specific signatures of complex disease

Variably DNA methylated loci in CD34+ haematopoietic stem and progenitor cells

Variable DNA methylation is enriched at functional elements

Using variability information to quantify cell subtypes

Colorectal-cancer risk-associated SNPs in distal regulatory regions

The effect of enhancer deletion on the transcriptome

Allelic imbalances in gene expression

Motif disruption is correlated with H3K27ac variation

The impact of chromatin on local mutation density in cancer cells

Large-scale epigenome imputation improves disease variant enrichment

Rights and permissions

About this article

Cite this article

Share this article

Search

Quick links

Variably DNA methylated loci in CD34⁺ haematopoietic stem and progenitor cells