Complex trait variants are enriched in diverse epigenomic marks

Integrative analysis of 111 reference human epigenomes.

Roadmap Epigenomics Consortium et al.Nature 10.1038/nature14248

For enhancer-associated H3K4me1 peaks, we found 58 studies (Fig. 9a, Extended Data 11a) with significant enrichments in at least one tissue at 2% FDR (Hypergeometric P<10-3.9). Upon manual curation, the enriched cell types were consistent with our current understanding of disease-relevant tissues for the vast majority of cases. For example, diverse immune traits were enriched in immune cell enhancers, including rheumatoid arthritis, celiac disease, type 1 diabetes, systemic lupus erythematosus, chronic lymphocytic leukemia, allergy, multiple sclerosis, and Graves’ disease 75-81. A large number of metabolic trait variants are enriched in liver enhancer marks, including LDL, HDL, total cholesterol, lipid metabolism phenotypes, and metabolite levels 82,83. Fasting glucose was most enriched for pancreatic islet enhancer marks, and insulin-like growth factors in placenta, consistent with their endocrine regulatory roles 84,85. Several cardiac traits were enriched in heart tissue enhancers, including the PR heart repolarization interval, blood pressure, and aortic root size. Interestingly, inflammatory bowel disease and ulcerative colitis variants show enrichment in both immune and gastrointestinal enhancer marks, suggesting dysregulation of both organs may underlie disease predisposition. Both attention deficit hyperactivity disorder and adiponectin levels were enriched in brain regions, consistent with causal roles in brain dysregulation 86,87. In contrast, late-onset Alzheimer's disease variants were enriched in immune cell enhancers, rather than brain, consistent with recent evidence of a possible immune and inflammatory basis 88-90.

Utilizing epigenomic annotations to gain insights into Alzheimer’s-disease-associated loci

Conserved epigenomic signals in mice and humans reveal immune basis of Alzheimer’s disease.

Gjoneska, E. et al.Nature 10.1038/nature14252

We next utilized the epigenomic annotations of increased-activity enhancer orthologs to gain insights into AD-associated loci (Supplementary Table S7). Among the 20 genome-wide significant AD-associated loci4, 11 contain no protein-altering SNPs in linkage disequilibrium (LD), indicating they may play non-coding roles. Of these, 5 localize within increased-level enhancer orthologs, including two well-established GWAS loci (PICALM, BIN1), and three loci (INPP5D, CELF1/SPI1, PTK2B) only recently recognized as significant by combining all AD cohorts.

For INPPD5 (Fig. 3a), a known regulator of inflammation28, the most significant variants localize within an increased-level enhancer ortholog, which also shows CD14+ enhancer activity. In the CELF1 locus (Fig. 3b) a large region of association spans several genes, but the strongest genetic signal (p=2x10-6) localizes upstream of SPI1 (PU.1), and specifically within an increased-level enhancer ortholog that is also active in immune cells. We confirmed that the AD-associated C-T substitution, rs1377416, in the SPI1 enhancer leads to increased in vivo enhancer activity in murine BV-2 microglia cells using a luciferase reporter assay (Fig. 3d). In addition, the AD-associated SNP rs55876153 near SPI1, which overlaps an increased-level mouse enhancer ortholog, is in strong linkage disequilibrium (LD=0.89, see Methods) with a known SPI1 eQTL, rs1083869825, even though it did not significantly alter enhancer activity in the luciferase assay.

Outside known GWAS loci, an additional 22 weakly-associated regions (3.9 fold, p<4.9x10-7) contain variants within increased-level enhancer orthologs (Supplementary Table S7), of which 17 lack protein-altering variants in LD (R2<0.4), providing strong candidates for directed experiments. One such example includes ABCA1 (p=6.9x10-5, Fig. 3c), a paralog of AD-associated ABCA7 and encoding a glial-expressed transporter that influences APOE metabolism in the central nervous system29. The region lacks protein-altering variants and all five SNPs in the cluster of association lie specifically within an increased-enhancer ortholog, which is also active in CD14+ immune cells and, to a lesser extent, in human hippocampus and fetal brain.

Fine-mapped genetic architecture of disease

Genetic and epigenetic fine mapping of causal autoimmune disease variants.

Farh, K. K.-H. et al.Nature 10.1038/nature13835

Prior studies that have integrated GWAS with epigenomic features focused on lead SNPs or multiple associated SNPs within a locus, of which only a small minority reflects causal variants10,16–19,21. Although these studies demonstrated enrichments within enhancer-like regulatory elements, they could not with any degree of certainty pinpoint the specific elements or processes affected by the causal variants. To overcome this limitation, we leveraged dense genotyping data to refine a statistical model for predicting causal SNPs from genetic data alone. Rare recombination events within haplotypes can provide information on the identity of the causal SNP, provided sufficient genotyping density and sample size. We therefore examined a cohort of 14,277 cases with multiple sclerosis and 23,605 healthy controls genotyped using the Immunochip, which comprehensively covers 1000 Genomes Project SNPs22 within 186 loci associated with autoimmunity20. We developed an algorithm, Probabilistic Identification of Causal SNPs (PICS), that estimates the probability that an individual SNP is a causal variant given the haplotype structure and observed pattern of association at the locus (Methods, Extended Data Figs 1–4).

We next generalized PICS to analyse 21 autoimmune diseases, using Immunochip data when they were available or imputation to the 1000 Genomes Project22 when they were not (Methods; Supplementary Table 1). We mapped 636 autoimmune GWAS signals to 4,950 candidate causal SNPs (mean probability of representing the causal variant responsible for the GWAS signal: ~10%). PICS indicates that index SNPs reported in the GWAS catalogue have on average only a 5% chance of representing a causal SNP. Rather, GWAS catalogue index SNPs are typically some distance from the PICS lead SNP (median 14 kb), and many are not in tight LD (Fig. 1d, Extended Data Fig. 5). PICS identified a single most likely causal SNP (>75% probability) at 12% of loci linked to autoimmunity. However, most GWAS signals could not be fully resolved due to LD and thus contain several candidate causal SNPs (Fig. 1e).

To confirm the functional significance of fine-mapped SNPs, we compared PICS SNPs against a strict background of random SNPs drawn from the same loci. Candidate causal SNPs derived by PICS were strongly enriched for protein-coding (missense, nonsense, frameshift) changes, which account for 14% of the predicted causal variants compared to just 4% of the random SNPs. Modest enrichments over the locus background were also observed for synonymous substitutions (5%), 3′ UTRs (3%), and splice junctions (0.2%) (Fig. 1f). Although these results support the efficacy of PICS for identifying causal variants, 90% of GWAS hits for autoimmune diseases remain unexplained by protein-coding variants. Candidate causal SNPs and the PICS algorithm are available through an accompanying online portal (http://www.broadinstitute.org/pubs/finemapping).

Cell-type-specific signatures of complex disease

Genetic and epigenetic fine mapping of causal autoimmune disease variants.

Farh, K. K.-H. et al.Nature 10.1038/nature13835

Along with the 21 autoimmune diseases, we predicted causal SNPs for 18 other traits and diseases (Methods). Comparing SNP locations with chromatin maps for 56 cell types revealed the cell type-specificities of cis-regulatory elements that coincide with PICS SNPs, thus predicting cell types contributing to each phenotype (Fig. 3). The patterns are more informative than the expression patterns of genes targeted by coding GWAS hits (Extended Data Fig. 8). Notable examples include SNPs associated with Alzheimer’s disease and migraine, which map to enhancers and promoters active in brain tissues, and SNPs associated with fasting blood glucose, which map to elements active in pancreatic islets. Nearly all of the autoimmune diseases preferentially mapped to enhancers and promoters active in CD4+ T-cell subpopulations. However, a few diseases, such as systemic lupus erythematosus, Kawasaki disease, and primary biliary cirrhosis, preferentially mapped to B-cell elements. Notably, ulcerative colitis also mapped to gastrointestinal tract elements, consistent with its bowel pathology. Although the primary signature of type 1 diabetes SNPs is in T-cell enhancers, there is also enrichment in pancreatic islet enhancers (P < 10−7). Thus, although immune cell effects may be shared among autoimmune diseases, genetic variants affecting target organs such as bowel and pancreatic islets may shape disease-specific pathology.

Variably DNA methylated loci in CD34+ haematopoietic stem and progenitor cells

The meta-epigenomic structure of purified human stem cell populations is defined at cis-regulatory sequences.

Wijetunga, N. A. et al.Nature Communications 10.1038/ncomms6195

We used two sources of DNA methylation data, one from the Roadmap Epigenomics programme, publicly available reduced representation bisulphite sequencing (RRBS)23 data on mobilizedCD34+ HSPCs from 7 adults, and the second generated by our group, using CD34+ HSPCs isolated from cord blood from 29 phenotypically normal neonates assayed using the HELP-tagging assay24. Despite the differences in how each of these assays measures DNA methylation, both showed increased variability at loci with intermediate methylation values (Fig. 1), consistent with previous observations16.

We continued our analyses based on the HELP-tagging data,which are derived from a greater number of samples and from neonates, who have less potential for manifesting age-associated variability than adults25. As HELP-tagging is based on the use of methylation-sensitive restriction enzymes24, we were able to use the results from the methylation-insensitive MspI control enzyme to estimate the degree of technical variability, and a permutation analysis of the HpaII-derived data also showed enrichment of the observed variability over expected background levels (Supplementary Fig. 2a). A number of loci with differing degrees of variability were chosen for bisulphite PCR, using seven of the samples that had been tested using HELP-tagging as well as eight independent samples. These amplicons were combined for each individual and used to generate Illumina libraries, allowing targeted massively parallel sequencing of the bisulphite-converted DNA. The results confirm that DNA methylation variability isenriched at loci with variability measures above the threshold attributable to technical variability or chance (Fig. 2b).

Variable DNA methylation is enriched at functional elements

The meta-epigenomic structure of purified human stem cell populations is defined at cis-regulatory sequences.

Wijetunga, N. A. et al.Nature Communications 10.1038/ncomms6195

With the genome annotated for functional elements in a cell type specific manner, we then tested the associations between genomic annotations and the loci with increased variability in DNA methylation. In Fig. 4a, we show the strongest associations for highly variable loci within clustered SOM space to be with H3K27ac, H3K27me3 and H3K4me1. Figure 4b also shows enrichment of variability at the TSS of RefSeq genes for feature 6 and immediately upstream at feature 4, both significant at P<0.001. Figure 4c shows enrichment in variability at the proximal part of CpG island shores for feature 6 and more extensively into the CpG island shore for feature 4, both also significant at P<0.001. A complementary SOM analysis using the published ChromHMM annotations of the human genome32 reveals consistent results (Supplementary Fig. 8). DNA methylation variability is therefore enhanced at candidate cis-regulatory sequences (promoters and enhancers) and the epigenetic variability previously observed for CpG island shores 8,31 is reflective of this general characteristic of enhancers. Common SNPs are not enriched in density in any of the features (Supplementary Fig. 9) and therefore are unlikely to be the major reason for selective enrichment of epigenetic variability in these specific genomic contexts. If the variability of DNA methylation occurs at loci with potential transcriptional regulatory properties, it raises the question of whether variability occurs selectively near genes with specific transcriptional activities. We find that all levels of expression have comparable levels of epigenetic variability at promoter sequences. Genes expressed at the lowest levels in the genome are those with selective enrichment of epigenetic variability at nearby candidate enhancers (Fig. 5), a significant inverse relationship (P=10-8) using the Jonckheere trend test.

Using variability information to quantify cell subtypes

The meta-epigenomic structure of purified human stem cell populations is defined at cis-regulatory sequences.

Wijetunga, N. A. et al.Nature Communications 10.1038/ncomms6195

Variability of gene transcription levels in cell samples from multiple individuals has allowed patterns to be identified that predict the numbers of cell subtypes present. We adapted one of the approaches used for these transcriptional variability studies, non-negative matrix factorization (NMF)34,35, to our DNA methylation variability data to estimate the number of cell subtypes in our purified CD34+ HSPCs. In Fig. 6, we show the NMF output to predict ~13–20 cell subtypes, consistent with the ~15 distinct types of cells that have previously been described to express the CD34 cell surface marker33.

Colorectal-cancer risk-associated SNPs in distal regulatory regions

Functional annotation of colon cancer risk SNPs.

Yao, L., Tak, Y. G., Berman, B. P. & Farnham, P. J. Nature Communications 10.1038/ncomms6114

Most of the SNPs in LD with the CRC GWAS tag SNPs cannot be easily linked to a specific gene because they do not fall within a coding region or a promoter-proximal region. However, it is possible that a relevant SNP associated with increased risk lies within a distal regulatory element of a gene whose function is important in cell growth or tumorigenicity. To address this possibility, we used the histone modification H3K27Ac to identify active regulatory regions throughout the genome of colon cancer cells or normal sigmoid colon cells. We used HCT116 H3K27Ac ChIP-seq data 16 produced in our lab for the tumor cells and we obtained H3K27Ac ChIP-seq data for normal colon cells from the NIH Roadmap Epigenome Mapping Consortium. The ChIP-seq data for both the normal and tumor cells included two replicates. To demonstrate the high quality of the datasets, we called peaks on each replicate of H3K27Ac from HCT116 and each replicate of H3K27Ac from sigmoid colon using Sole-search 27, 28 and compared the peak sets from the two replicates using the ENCODE 40% overlap rule (after truncating both lists to the same number, 80% of the top 40% of one replicate must be found in the other replicate and vice versa). After determining that the HCT116 and sigmoid colon datasets were of high quality (Supplementary Figure 4), we merged the two replicates from HCT116 and separately merged the two replicates from sigmoid colon and called peaks on the two merged datasets; see Supplementary Data 2 for a list of all ChIP-seq peaks. Using the merged peak lists from each of the samples as biofeatures in FunciSNP, we determined that 746 of the 4894 SNPs that were in LD with a tag SNP at r2>0.1 were located in H3K27Ac regions identified in either the HCT116 or sigmoid colon peak sets; of these 270 SNPs had an r2>0.5 with a tag SNP (Figure 1 and Supplementary Figure 5).

A comparison of the H3K27Ac peaks from normal and tumor cells indicated that the patterns are very similar; in fact, ~24,000 H3K27Ac peaks are in common in the normal and tumor cells. However, there are clearly some peaks unique to normal and some peaks unique to the tumor cells. Therefore, we separately analyzed the normal and tumor H3K27Ac ChIP-seq peaks as different sets of biofeatures using FunciSNP (Figure 1B). Of the 746 SNPs, 236 were located in a H3K27Ac site common to both normal and tumor cells, whereas 140 were unique to tumor and 370 were unique to normal cells. Visual inspection of the SNPs and peaks using the UCSC genome browser showed that many of the identified enhancers harbored multiple correlated SNPs. Reduction of the number of SNPs to the number of different H3K27Ac sites resulted in 47 common, 41 tumor-specific, and 111 normal-specific regions. Visual inspection also showed that some of the H3K27 genomic regions corresponded to promoter regions (Supplementary Figure 4). Because promoter regions having correlated SNPs were already identified using TSS regions (see above), we eliminated the promoter-proximal H3K27Ac sites, resulting in 27 common, 32 tumor-specific, and 96 normal-specific distal H3K27Ac regions. As the next winnowing step, we selected only those enhancers having at least one SNP with an r2> 0.5, leaving 18 common, 9 tumor-specific, and 41 normal-specific distal H3K27Ac regions. We noted that some of the identified regions corresponded to low ranked H3K27Ac peaks. For our subsequent analyses, we wanted to limit our studies to robust enhancers that harbor correlated SNPs. Therefore, we visually inspected each of the genomic regions identified as having distal H3K27Ac peaks harboring a correlated SNP. To prioritize the distal regions for further analysis, we eliminated those for which the correlated SNPs was on the edge of the region covered by the H3K27Ac signal or corresponded to a very low-ranked peak. After inspection, we were left with a set of 27 distal H3K2Ac regions in which a correlated SNP (r2>0.5) was well within the boundaries of a robust peak (Figure 1B). To confirm our results, we repeated the analysis using H3K27Ac data from a different colon cancer cell line, SW480, identifying only one additional enhancer harboring risk SNPs for CRC. The genomic coordinates of each of these 28 enhancers, which are clustered in 9 genomic regions, are listed in Table 3 (see also Supplementary Table 1). Combining all data, enhancers in 5 of the 9 regions were identified in all 3 cell types and 8 of the 9 regions were identified in at least two of the cell types.

The effect of enhancer deletion on the transcriptome

Functional annotation of colon cancer risk SNPs.

Yao, L., Tak, Y. G., Berman, B. P. & Farnham, P. J. et al.Nature Communications 10.1038/ncomms6114

The expression analyses described above provide a list of genes that potentially are regulated by the CRC risk-associated enhancers. However, it is possible that the enhancers regulate only a subset of those genes and/or the target genes are at a greater distance than was analyzed. One approach to identify targets of the CRC risk-associated enhancers would be to delete an enhancer from the genome and determine changes in gene expression. As an initial test of this method, we selected enhancer 7, located at 8q24. The region encompassing this enhancer has previously been implicated in regulating expression of MYC 31, which is located 335 kb from enhancer 7. We introduced guide RNAs that flanked enhancer 7, along with Cas9, into HCT116 cells, and identified cells that showed deletion of the enhancer. We then performed expression analysis using gene expression arrays, identifying 105 genes whose expression was down-regulated in the cells having a deleted enhancer (Supplementary Data 5 ); the closest one was MYC, which was expressed 1.5 times higher in control vs. deleted cells (Figure 4).

Epigenomic annotation of genetic variants using the Roadmap Epigenome Browser.

Zhou, X. et al.Nature Biotechnology 10.1038/nbt.3158

Advances in next-generation sequencing platforms have reshaped the landscape of functional genomic and epigenomic research as well as human genetics studies. Annotation of noncoding regions in the genome with genomic and epigenomic data has facilitated the generation of new, testable hypotheses regarding the functional consequences of genetic variants associated with human complex traits1,2. Large consortia, such as the US National Institutes of Health (NIH) Roadmap Epigenomics Consortium3 and ENCODE4, have generated tens of thousands of sequencing-based genome-wide data sets, creating a useful resource for the scientific community5. The WashU Epigenome Browser6-8 continues to provide a platform for investigators to effectively engage with this resource in the context of analyzing their own data. Here, we describe the Roadmap Epigenome Browser (http://epigenomegateway.wustl.edu/browser/roadmap), which is based on the WashU Epigenome Browser and integrates data from both the NIH Roadmap Epigenomics Consortium and ENCODE in a visualization and bioinformatics tool that enables researchers to explore the tissue-specific regulatory roles of genetic variants in the context of diseases. The Browser takes advantage of over 10,000 epigenomic data sets it currently hosts, including 346 ‘complete epigenomes’, defined as tissues and cell types for which we have collected a complete set of DNA methylation, histone modification, open chromatin and other genomic datasets9. Data from both the NIH Roadmap Epigenomics and ENCODE resources are seamlessly integrated in the browser using a new Data Hub Cluster framework. Investigators can specify any number of SNP-associated regions and any type of epigenomic data, for which the browser automatically creates “virtual data hubs” through a shared hierarchical metadata annotation, retrieves the data, and performs real-time clustering analysis. Investigators interact with the Browser to determine the tissue specificity of the epigenetic state encompassing genetic variants in physiologically or pathogenically relevant cell types from normal or diseased samples.

We illustrate the epigenomic annotation of two noncoding SNPs, identified from genome-wide association studies of people with multiple sclerosis10, by clustering the histone H3K4me1 profile of SNP-harboring regions and RNA-seq signal of their closest genes across multiple primary tissues and cells (Fig. 1). Both SNPs lie within putative enhancer regions. Whereas rs307896 marks an enhancer common across cell types, rs756699 is located in an enhancer specific to immune cells and is potentially targeting TCF7, a T cell specific gene 3.8kb downstream (Fig. 1). Thus, reference epigenomes provide important clues into the functional relevance of these genetic variants in the context of the pathophysiology of multiple sclerosis, including inflammation11.

Investigators can also use the browser to identify co-variation of epigenomic, transcriptomic, and transcription factor binding profiles across cell types to predict relationships between regulatory sites and target genes. Additionally, investigators can explore multiple complete reference epigenomes in different browser panels in parallel using synchronized genomic coordinates or independent genomic coordinates. A variety of Epigenome Browser functions, including gene set view, genome juxtaposition, chromatin interaction display and statistical testing, can be applied to better engage with this epigenomic resource.

We also provide the means for investigators to build their own Data Hub Clusters of different scales and clone the browser on Amazon Cloud to visualize and analyze private data in the context of public data. These tools, along with the rapidly growing epigenomic datasets of human cells of different states, will facilitate the translation of genetic signals into molecular mechanisms, leading to prognostic, diagnostic and therapeutic advances.

Allelic imbalances in gene expression

Chromatin architecture reorganization during stem cell differentiation.

Dixon, J. R. et al.Nature 10.1038/nature14222

Previous studies of allele-resolved gene expression have identified widespread allelic imbalances in gene expression between alleles24-27, 33. However, it remains unclear to what degree allele-biased gene expression varies among different lineages of a single individual. To address this, we re-analyzed haplotype-resolved mRNA-seq data and identified allelic biases in gene expression across the five H1 lineages. A total of 1,787 genes showed allelic bias in gene expression in one or more lineages studied here, representing ~24% of all testable genes (FDR 10%, Figure 4a). Most allelic differences in expression are not “on/off” events, but instead reflect biases in the level of expression from each allele (Figure 4b). Further, allele-biased genes include both lineage specific and constitutively expressed genes (Extended Data Figure 6c,d), and patterns of allelic bias can also be constitutive or cell-type variable (Figure 4c,d). Only in rare cases do genes switch expression from one allele to the other between cell types. As expected, genes subject to genomic imprinting are enriched among genes with allelic biases in expression (Figure 4e), though these represent ~1% of allele-biased genes (Figure 4f). While imprinted genes often occur in clusters, the majority of allele-biased gene expression is not clustered in the genome (Extended Data Figure 6e). Taken together, these data suggest that most allelic gene expression is due to mechanisms other than genomic imprinting. One possible regulatory mechanism that could give rise to allele-biased expression would be allelic bias in activity of cis-regulatory elements near these genes. Indeed, regions of the genome that show allele bias in histone acetylation, histone methylation, CTCF, and DNase I hypersensitivity are closer to allele-biased genes than randomly selected genomic regions (Figure 4g). Furthermore, allelic gene expression is strongly correlated with DNA methylation or chromatin modification state at promoters (Figure 4h,i). Of the 247 genes that contain heterozygous variants in their promoter regions and display biased transcription in at least one lineage, a majority exhibit allele-biased chromatin modifications or DNA methylation at the promoter (Figure 4h). Interestingly, 29% of the testable genes that have allele-biased expression show no evidence of allelic bias in chromatin state or DNA methylation at the promoter (Figure 4h), raising the possibility that elements outside of promoters may be responsible for the allelic gene expression.

Motif disruption is correlated with H3K27ac variation

Predicting the human epigenome from DNA motifs.

Whitaker, J. W., Chen, Z. & Wang, W. Nature Methods 10.1038/nmeth.3065

A recent study of 19 individuals correlated sequence variation at known TF motif sites with variation in H3K27ac levels at overlapping peaks 43. Kasowski et al. found that H3K27ac variation in 32,886 peaks correlated with disruption of 662 known motifs by SNPs among the 19 individuals, and significant association was found in 32% of regions (significance determined using Spearman’s rank and label permutation 43). To demonstrate the power of the Epigram pipeline, we repeated the analyses done by Kasowski et al. by first running Epigram on H3K27ac, which resulted in a full model featuring 133 motifs that are predictive of H3K27ac (Online Methods). Epigrams motifs were significantly correlated in 62% of regions using a motif set that is ~20% the size of those used by Kasowski et al. (662 known motifs). Thus, Epigram discovers motifs that are significantly correlated with H3K27ac variation in 30% more regions and that represent the novel binding patterns for regulators of H3K27ac. Furthermore, Kasowski et al.43 showed 20 TFs that are significantly correlated within ~4,500 variable regions, whereas the motifs from Epigram’s 20-motif model are significantly correlated within 7,006 variable regions (Fig. 5). One of the Epigram’s 20 motifs matches the known IKZF1 motif, which has been shown to target chromatin remodeling and deacetylation complexes during lymphocyte differentiation 44. In addition, we also found that 3 of these 20 motifs match motif groups identified to be associated with H3K27ac in H1, NPC, MSC and TBL. Taken together, Epigram is able to explain significantly more variants while using fewer motifs than the Kasowski et al. analysis.

The impact of chromatin on local mutation density in cancer cells

Cell-of-origin chromatin organization shapes the mutational landscape of cancer.

Polak, P. et al.Nature 10.1038/nature14221

The comparison of individual epigenomic features with local mutation density revealed that chromatin marks corresponding to the tumour’s cell type of origin are more strongly associated with local mutation density than marks corresponding to unrelated cell types. For example, DHS marks from melanocytes explained a substantially larger fraction of the variance in melanoma mutation density than DHS marks from other cell types, even from the same tissue (skin) (Fig. 1b). As another example, even though H3K4me1 marks in melanocytes and hepatocytes are highly correlated (r = 0.8), the distribution of mutations in liver cancer followed the levels of H3K4me1 in hepatocytes but not in melanocytes, whereas melanoma mutations correlated with the levels of H3K4me1 in melanocytes but not in hepatocytes (Fig. 1c).

This initial observation suggested that the impact of chromatin on local mutation density is highly cell-type specific. The comprehensive representation of different cell types in the Epigenome Roadmap could thus enable an improved prediction accuracy of mutations compared to previous studies. To rigorously quantify the contribution of different chromatin marks and gene expression to regional mutation density, and the extent of cell type specificity, we used Random Forest regression (Methods). Remarkably epigenetic marks, together with replication timing measured in ENCODE consortium cell lines18, collectively accounted for 74–86% of the variance in mutation density in seven cancer types (Fig. 2a). In glioblastoma, for which fewer mutations were available for the analysis, 55% of the variance in mutation density could be explained. This is substantially higher than in earlier studies4 and indicates that, at least for these cancer types, we have identified a set of epigenetic variables and cell types that almost fully predict the mutational variability along the genome. This enhanced prediction accuracy was not simply due to the larger size of the training data relative to previous studies, as the predictive ability dropped by only ~2–6% when only 10% of the data was used (Extended Data Fig. 3).

Taken together, the above results strongly suggest that the cell of origin of an individual tumour sample could be predicted from its mutation pattern alone. Mutation profiles of individual samples cluster according to cancer type, and, consequently cell of origin (Fig. 4a). We developed a straightforward predictor based on enrichment of epigenomic variables from a single cell type among the top 20 variables selected by the Random Forest analysis. This approach classified 88% of melanoma, colorectal, liver, multiple myeloma, oesophageal and glioblastoma cancer genomes to melanocytes, colonic mucosa, liver, haematopoietic, stomach mucosa and brain tissues, respectively (Fig. 4b). Thus, mutational patterns contain sufficient information for identifying the cell type of origin of a tumour. We propose that sequencing the DNA of a tumour of unknown primary origin can allow the precise pinpointing of the cell type of origin of that tumour.

Large-scale epigenome imputation improves disease variant enrichment

Large-scale imputation of epigenomic datasets for systematic annotation of diverse human tissues.

Ernst, J. & Kellis, M. NATURE BIOTECHNOLOGY 10.1038/nature14248

As epigenomic maps have recently emerged as an unbiased approach for discovering disease-relevant tissues and cell types3,32, we also evaluated the impact of epigenome imputation on the interpretation of trait-associated variants from GWAS. We quantified the enrichment (positive or negative) of trait-associated variants from the NHGRI GWAS catalog33 in both observed and imputed datasets for each mark. We evaluated enrichments both in aggregate across all studies, based on Area under an ROC curve up to a 5% false positive rate (AUC5%) for the signal level recovery of trait-associated SNPs, and at the level of individual studies, based on mark signal rank differences between each study's SNPs and all other SNPs in the GWAS catalog (see Methods). We evaluated both the number of studies for which there was a significant signal rank difference in at least one sample, and the total number of study-sample pairs that are significant, at varying p-value thresholds. We then compared both the number of significant studies and the number of significant pairs to the numbers obtained for randomized versions of the GWAS catalog, which also enabled us to obtain a false discovery rate estimate for each p-value threshold (Table S2, see Methods).

For all Tier-1 active marks, imputed data resulted in substantially greater recovery of SNPs in the GWAS catalog (Fig. S23) than the observed data, and more significant enrichments for both the number of studies, and the number of study-sample pairs, across all tested significance thresholds (Fig. 4a, Fig. S24-S25). In addition, the imputed data yielded a stronger enrichment for each enriched sample in the large majority of cases for nearly all marks (Fig. 4b, Fig. S26). We confirmed that the actual GWAS catalog yielded more significant associations than randomized versions, for both the observed and imputed data (Fig. 4a, Fig. S24-S25). Imputed data performance was substantially higher than that of the average mark signal across all available samples (Fig. S24b), emphasizing the increased performance was not simply due to averaging multiple samples. We also confirmed that the top most significant enriched samples for a given study were generally biologically relevant for active marks: for H3K27ac for example, we found that liver was enriched in various cholesterol phenotypes, that immune-related cells were enriched in various immune related disorders, ulcerative colitis in the colonic mucosa and many other biologically-meaningful enrichments (Fig. 4c-f, Table S2).

These results help validate the biological relevance of imputed datasets based on an orthogonal annotation source, and help illustrate imputed datasets as a potentially useful resource for interpreting GWAS results.

Figure 1: Epigenomic enrichments of genetic variants associated with diverse traits.
figure 1

Tissue-specific H3K4me1 peak enrichment for genetic variants associated with diverse traits. Circles denote reference epigenome (column) of highest enrichment for SNPs reported by a given study (row), defined by trait and publication (PubMed identifier, PMID). Tissue (Abbrev) and p-value (-log10) of highest enrichment are shown. Only rows and columns containing a value meeting a FDR of 2% are shown (Full matrix for all studies showing at least 2% FDR in Extended Data 11-12).

Figure 2: Increasing enhancer orthologs help interpret AD-associated non-coding loci.
figure 2

Overlap of disease-associated SNPs (top) with increasing enhancers (2nd row, red) and immune enhancers in human (CD14+ primary cells) is shown for genome-wide significant (INPP5D and SPI1/CELF1; panels a, and b) and below-significance (ABCA1; panel d) AD GWAS loci. Roadmap chromatin state annotations for immune cells (CD14+ primary; E029), hippocampus (E071), and fetal brain (E81), with colors as shown in the legend. Light red highlight denotes increasing enhancer regions tested in luciferase assay. c, AD associated SNP rs1377416 amplifies in vitro luciferase activity of putative enhancer region 38,313 - 37,359 bp upstream of SPI1 (PU.1) gene in BV-2 cells. (n=3, One-way ANOVA p<0.0001, *Tukey’s test p<0.05). ns, nonsignificant.

Figure 3: Genetic fine-mapping of human disease.
figure 3

a, GWAS catalog loci were clustered to reveal shared genetic features of common human diseases and phenotypes. Color scale indicates correlation between phenotypes (high=red, low=blue). b, Association signal to MS for SNPs at the IFI30 locus. c, Scatter plot of SNPs at the IFI30 locus demonstrates the linear relationship between LD distance (r2) to rs1154159 (red) and association signal. d, Candidate causal SNPs were predicted for 21 autoimmune diseases using PICS. Histogram indicates genomic distance (bp) between PICS Immunochip lead SNPs and GWAS catalog index SNPs. e, Histogram indicates number of candidate causal SNPs per GWAS signal needed to account for 75% of the total PICS probability for that locus. f, Plot shows correspondence of PICS SNPs to indicated functional elements, compared to random SNPs from the same loci (error bars indicate standard deviation from 1000 iterations using locus-matched control SNPs).

Figure 4: Cell-type specificity of human diseases.
figure 4

Heatmap depicts enrichment (red=high; blue=low) of PICS SNPs for 39 diseases/traits in acetylated cis-regulatory elements of 33 different cell types.

Figure 5: DNA methylation variability is increased at loci of intermediate methylation.
figure 5

The MAD for DNA methylation values in CD34+ HSPCs measured by HELP-tagging (top, 29 individuals) or RRBS (bottom, 7 individuals) are shown as a function of mean DNA methylation across all of the samples tested. Although HELP-tagging usually plots DNA methylation with a zero value to indicate complete methylation, we inverted the scale on this occasion to make the two plots comparable. The number of loci is reflected by the grey shading. The line shown indicates the mean MAD value and reveals for both data sets increased variability of DNA methylation at loci with intermediate values.

Figure 6: Empirical annotation of the CD34+ HSPC genome based on chromatin features reveals candidate cis-regulatory element locations
figure 6

(a) A contour plot of the regions within the SOM where Segway features 4 (above) and 6 (below) enrich, showing feature 4 to be composed of loci where H3K4me1 and H3K27me3 occur, while the loci composing feature 6 contain the H3K4me3 and H3K27ac modifications. Consistent with these findings, b shows feature 6 (red) to be enriched at the TSS for a metaplot (top) and a heat map (below) of all RefSeq genes, indicating promoter characteristics, while feature 4 (yellow) flanks this region and is consistent with enhancers in a poised state. In c, similar metaplot (top) and heat map (below) representations of the 2-kb flanking CpG islands demonstrate strong enrichment in feature 4, indicating that these ‘CpG island shores’ in fact represent candidate enhancers in this cell type.

Figure 7: DNA methylation variability is enriched at candidate enhancers and promoters at TSSs of RefSeq genes and at CpG islands and shores
figure 7

(a) Enrichment of variability of DNA methylation is marked at loci with H3K4me1, H3K27ac and H3K27me3 marks in particular. (b) A RefSeq metaplot with feature density indicated by the grey shading above and within the graph, and mean variability for features 4 (yellow, top) and 6 (red, bottom) depicted, with increased variability distributing where each mark is maximally located. The significance of the enrichment is shown at the depicted peak P-value location. Analysis of CpG islands (c) shows variability in flanking regions (shores) associated with the presence of feature 4.

Figure 8: Variability of DNA methylation at candidate enhancer sequences discriminates genes expressed at lower levels
figure 8

(a) The overall pattern of DNA methylation variability at RefSeq genes broken down by expression quantile, showing differences at silent compared with expressed genes at TSS. (b) No such differences are measurable when testing candidate promoters (Segway feature 6, top), whereas candidate enhancers (Segway feature 4, bottom) show increased variability for DNA methylation for genes that are either silent or expressed at the lowest quartile.

Figure 9: NMF of DNA methylation profiles shows evidence for 13–20 subpopulations within the CD34+ HSPC population
figure 9

The upper plot shows a smooth spline (orange) and value distribution (blue) for the Frobenius norm as a function of increasing cell subpopulation number, with the lower plot representing the P-value testing (two sample t-test) whether the difference between the successive simulations is significant. We observe two points at which the subsequent change is insignificant, at values 13 and 20, suggesting that the number of subpopulations differing in DNA methylation profiles within the CD34+ HSPC population is within this range.

Figure 10: Identification of potential functional SNPs for CRC.
figure 10

(a) Shown is the number of SNPs identified by FunciSNP in each of three categories for 25 colon cancer risk loci (see Table 1 for information on each CRC risk SNP). For exons, only non-synonymous SNPs are reported; parentheses indicated the number of SNPs that are predicted to be damaging; see Table 2 for a list of the expressed genes associated with the correlated SNPs. For TSS regions, the region from −2 kb to +2 kb relative to the start site of all transcripts annotated in GENCODE V15, including coding genes and non-coding RNAs was used; see Table 2 for a list of expressed transcripts associated with the correlated SNPs. (b) For H3K27Ac analyses, ChIP-seq data from normal sigmoid colon and HCT116 tumour cells were used; see Table 3 for further analysis of distal regions harbouring SNPs in normal and tumour colon cells. The SNPs having an r2>0.1 that overlapped with H3K27Ac sites were identified separately for HCT116 and sigmoid colon data sets. Because more than one SNPs could identify the same H3K27Ac-marked region, the SNPs were then collapsed into distinct H3K27Ac peaks. The sites that were within ±2 kb of a promoter region were removed to limit the analysis to distal elements. To obtain a more stringent set of enhancers, those regions having only SNPs with r2<0.5 were removed. This remaining set of 68 distal H3K27Ac sites were contained within 19 of the 25 risk loci. Visual inspection to identify only the robust enhancers having linked SNPs not at the margins reduced the set to 27 enhancers located in 9 of the 25 risk loci; an additional enhancer was identified in SW480 cells (seeTable 3 for the genomic locations of all 28 enhancers). Colour key: green=SNPs or H3K27Ac sites unique to normal colon, red=unique to colon tumour cells, blue=present in both normal and tumour colon.

Figure 11: Identification of genes affected by deletion of enhancer 7.
figure 11

(a) Shown are the expression differences (x axis) and the significance of the change (y axis) of the genes in the control HCT116 versus HCT116 cells having complete deletion of enhancer 7. The Illumina Custom Differential Expression Algorithm was used to determine P-values to identify the significantly altered genes; three replicates each for the control and deleted cells were used. Genes on chromosome 8 (the location of enhancer 7) are shown in blue. The spot representing the ​MYC gene is indicated by the arrow. (b) Shown are all genes on chromosome 8 that change in expression and the 10 genes showing the largest changes in expression upon deletion of enhancer 7. The location of the enhancer is indicated and the chromosome number is shown on the outside of the circle. (c) The genes identified as potential targets using TCGA expression data are indicated; of these, ​MYC is the only showing a change in gene expression upon deletion of the enhancer.

Figure 12: Allelicbiases ingeneexpressionin H1 lineages.
figure 12

a, Proportion of genes with detectable allelic expression with statistically significant allelic bias. b, Density plot of the absolute value of the fold change in expression (log2) between alleles. c, Heat map showing k-means (k=20) clustering of the allelic expression ratios (log2) at genes with constitutively testable expression. d, Genome browser shot of variable allelic expression of the PARP9gene. e, Fraction of imprinted genes among allele-biased genes and other genes. (p=4.4e-5, Fisher exact test).f, Fraction of allele-biased genes that are known imprintedgenes. g,Cumulative density plot of distances fromvariants to the nearestallele-specific gene. Allele specific variants are defined using histone acetylation, H3K9me3, H3K27me3, DNaseI HS, and H3K4me3 (p<2.2e-16,KS-test).h,Number of allele-biased genes showing consistent allele specific chromatin states in their promoter regions. Activevariants are defined by H3K4me3, DNaseI HS,or histone acetylation. Inactive promoter variants are definedby DNA methylation and H3K9me3/27me3. i,Genome browser shotof mRNA-seq and chromatin features surrounding the TDGgene.

Figure 13: De novo motif disruption and H3K27ac levels are correlated.
figure 13

The disruption of de novo motifs was correlated with variation in H3K27ac levels43. Motifs are sorted by their number of significantly correlated peaks; peaks are sorted by their associated motifs. Matches with known TFs and motif groups (from the analysis of H1 and the four derived cell types) are shown on the left. Motif groups start with 'G:'.

Figure 15
figure 15

Predicting local mutation density in cancer genomes using Random Forest regression trained on 424 epigenomic profiles. Pearson correlation between observed and predicted mutation densities along chromosomes is shown. (a) Actual versus predicted mutation densities in eight cancers. (b, c) Prediction accuracy represented as mean ± s.e.m (estimated using 10-fold cross-validation). Panels show prediction accuracy for all mutations and for nucleotide changes predominant in the corresponding cancer (b), and prediction accuracy in lung adenocarcinoma genomes stratified by smoking history and predominant nucleotide changes (G>T or C>T) (c).

Figure 16
figure 16

Analysis of individual cancer genomes and prediction of cell type of origin. (a) Principal coordinate analysis (PCOA) of the distribution of mutations in individual cancer genomes. Filled circles represent cancers for which the correct cell type of origin was identified. (b) The accuracy of cell type of origin prediction for individual cancer genomes: the number of cancer samples that were assigned to the correct (solid colors) or incorrect (textures) cell types of origin based on their mutation profile.

Figure 17
figure 17

Mutation density in melanoma is associated with individual chromatin features specific to melanocytes. (a) The density of C>T mutations in melanoma alongside a 100kb window profile of melanocyte chromatin accessibility (“DNase I accessibility index”; shown in normalized, reverse scale; high values correspond to less accessible chromatin and vice versa). (b) The number of mutations per megabase in melanoma versus DHS density, for three types of skin cells. (c) The normalized density of mutations in liver cancer and melanoma genomes as a function of density quintiles of H3K4me1 marks in liver cells and in melanocytes. For both cancer genomes, mutation density depends only on H3K4me1 marks measured in the cell of origin.