In the last few years there has been a proliferation of genome-wide association studies, in which relationships are mapped between genomic sequence variants and predisposition to a disease or a trait of interest. These studies depend upon the participation of thousands of individuals in the research process. It has been assumed that it is not possible to determine, based on aggregate single-nucleotide polymorphism (SNP) data, whether or not a particular individual is present in a pool.

However, using a statistical approach, David Craig and colleagues at the Translational Genomics Research Institute now show that this assumption is incorrect.

Statistical analysis can identify whether an individual SNP profile is present in pooled genotype data. Credit: Katie Ris-Vicari

“One way you can understand what we're doing is if you think of a roulette table,” says Craig. “The colors it can have are either red or black, and let's say I want to know if the table is slightly biased towards black. If I spin it once, I wouldn't really get a good idea of bias because there isn't much information in a single measurement. But if I spin it half a million times, you can bet I could find some pretty subtle biases.” In other words, Craig and colleagues take advantage of the fact that it is possible to monitor hundreds of thousands of SNPs to determine whether or not the SNP profile of a particular individual is present in pooled profile data.

“What we do is essentially a t-test,” says Craig. They compare the allele frequencies for the person in question to the mean allele frequencies in a reference population and in the pooled test population. When this is done across hundreds of thousands of SNPs, it is possible to assess statistically whether or not the pooled data are shifted significantly in the direction of the person in question. It may even be possible, the researchers report, to use a relative of the person for this purpose. Notably, for the method to work, high-density SNP data for the person must already be available.

Using both simulations and experimental analysis with high-density SNP microarrays, Craig and his team show that it is possible to identify an individual in a mixture of hundreds to thousands of genomic samples, even when the DNA of the person in question is present only in trace amounts (as low as 0.1% of the total). Craig estimates that this may not be the limit of sensitivity. “My guess is it could go down to about one in ten thousand,” he says.

In addition to the consequences it will have for forensic analyses, this demonstration has implications for how pooled genotype data will be shared in the future. To protect individual privacy, the US National Institutes of Health and other organizations have already removed aggregate genomic data from public access, instating approval processes for accessing these data, similar to those already in place for accessing individual-level data.

Craig suggests, however, that there is another side to this story. “I hope this will open up the conversation about data sharing,” he says. “In my opinion, you really need to share individual-level data, since you lose a lot of power when you just share the aggregate information, and our work now shows that, even in aggregate data, the identity of participants is not completely masked. And I think it's better to work out how to do this responsibly now, when the amount of data is manageable, than in five or ten years.”