What were the key findings from this second stage of the ENCODE project?

Credit: Stanford

First, we are starting to see the genome at a level that has never been seen before. The genome sequence was completed in 2003, but that just laid out the bases. It was quite clear, even back then, that we didn't really know what the functions of many of those bases were. Only something like 1.5% are protein-coding, and the rest many people called junk DNA. The question was what did all this other stuff do? We thought that some of it had to be regulatory, but it is one thing to think it and another to know it. The big push in a pilot phase and over the past 5 years has been to characterize these remaining sequences. What we found were a lot of transcribed regions and regulatory regions that had never been seen before.

This is important because it is foundational. It gives us a framework on which other researchers can lay their own studies. It defines the parts list of where all the regulatory elements are and where all the transcripts are.

It is also a big deal because most disease loci map out outside coding sequences. It is estimated that 85% of disease-causing mutations do not lie inside a protein-coding region and a lot of them are thought to affect regulatory sequences. With the ENCODE maps, researchers can actually start looking at genetic variants that land on regulatory sequences and interpreting these in terms of the basis of disease, disease risk and drug response.

Were there any surprises for you in this set of findings?

There is a much higher density of regulatory regions than people had expected. It is safe to say that there are more sequences regulating genes than there are sequences coding for the genes themselves. A minimum number based on the work so far suggests that regulatory sequences make up at least 5% of the genome, but that number could easily creep up to 15% or 20% of the genome.

What are the implications for drug discovery?

The ENCODE project itself is mostly making the maps, but there are quite a few other groups out there that are using the maps to understand mutations that sit on regulatory sequences and disease risk.

By figuring out how variants in regulatory regions affect the expression of different genes, we can get a better grip on disease biology. And if gene function is reduced partially or over-activated because of a change in a regulatory sequence, this could help researchers identify proteins that might make appealing drug targets.

We are also, for the first time, starting to understand the combinatorial nature of gene regulation. Different genes are regulated by combinations of transcription factors, and it turns out that these factors come together in different ways to regulate different genes. We've now looked at so many transcription factors that we can start to understand which transcription factors come together to regulate which genes.

These advances are really powerful for drug discovery because they will help us to understand regulatory networks and identify where to intervene to develop drugs.

Is it feasible to use these maps to identify transcription factors themselves that are worth targeting?

Hitting transcription factors themselves is tough. I think it will be a matter of hitting both upstream and downstream elements once you've figured out the pathways you want to target. Some of those are likely to be more druggable.

What about the importance of ENCODE in terms of using genomic data to stratify patients into clinical trials and identify subgroups of patients who respond to treatment?

This is probably going to roll out in a very phasic fashion. The variants in regulatory sequences that we know how to interpret will be added to the workflow. And the ones we don't know how to interpret will just be left blank, so to speak.

Even for cancer genetics, people are also still mostly just interpreting the coding sequences. But we're moving towards looking at regulatory information as well.

ENCODE has made functional maps for specific cell types. How generalizable are the findings from these cell lines to disease-relevant cell lines?

We don't know the answer to that just yet. And that is one reason why we need to expand the number of cell types that ENCODE analyses.

We did look at blood-related cell lines though, so I think for blood-related diseases there will be some direct relevance. Towards the end of this stage of the project we also started depositing data on liver cells, and we are about to deposit some information on heart cells, which could also have implications particularly in terms of understanding drug toxicity.

There are also model organism ENCODE (modENCODE) projects that are generating functional genomic maps in worms and flies to inform human biology. But you do have to be cautious, because only a fraction of the regulatory information is conserved across humans and animal models.