Identifying Samples from Genomic Data

In the movie GATTACA, a futuristic society uses an individual’s DNA sequence at birth to determine his or her probable fate. The protagonist has a 99% probability of congenital heart defect, according to his readout, which means that he’d never be able to achieve his dream of going into space (at least through proper channels). Instead, he undertakes a great deception, infiltrating a corporate space program by posing as a “genetically ideal” candidate. Along the way, we learn that some of the selectively bred humans around him aren’t as perfect as they’re supposed to be. It’s a story that underscores a real modern-day fear: that genetic information, decoded by ever-advancing DNA sequencing technologies, might one day be subverted and used against us.

Cognizant of these fears, scientists and policymakers in the U.S. have taken a number of steps to prevent the misuse of genetic information and safeguard the privacy of patients. Research projects involving human subjects are required to “de-identify” samples, to insulate relevant health information (such as genetic data) from identifying information (patient names). To minimize patient concerns and encourage participation, consent documents such as those used by the 1,000 Genomes Project reassure participants that “it will be hard for anyone to find out anything about you personally from any of this research.”

Note the use of the word hard, rather than impossible. And as it turns out, even that was wrong.

Identification of Personal Genomes by Surname Inference

Credit: Science (2013)

In last week’s issue of Science, Melissa Gymrek and colleagues from the lab of Yaniv Erlich (Whitehead) report a method for the triangulation the identity of a sample donor using genomic data and public databases.

As a proof-of-principle, they uncovered the identities of about 50 sample donors from the CEPH Utah collection, perhaps the best-studied collection of “anonymous” samples to date. Their approach exploits several facts of this “information age” we live in.

Paternal Lineage and Surnames

One of those facts is that, in the United States and many countries, children usually take the surnames of their fathers. The Y-chromosome, too, is passed down the paternal line. Thus, Y-chromosomes and surnames are usually inherited together. There are exceptions, of course — adoption, mistaken paternity, divorce, mutations — but for the most part this rule holds true. Thus, male [blood] relatives sharing a paternal line tend to have the same last name and same Y-chromosome.

STRs and DNA Profiling

Y-chromosomes have a number of highly-polymorphic genetic markers called short tandem repeat (STRs). STRs have been around for a long time. They can be genotyped using gel electrophoresis, which made them some of the earliest genetic markers that we could genotype. Though not as common as other forms of variation (e.g. SNPs), STRs have many measurable alleles: with about a dozen well-chosen STRs you can match one DNA sample to another with very high precision.

In fact, the unique DNA profile used in most law enforcement databases (notably the FBI’s CODIS) comprises the genotypes of about 15 STRs genome-wide. When you’re watching CSI and someone says “The DNA profiles had 7 alleles in common,” they’re talking about STRs (here, half of the alleles were shared so the samples were first-degree relatives).

From Genomes to STRs

There’s a lot of human genome resequencing data in the public domain. Most of that is by design: the 1,000 Genomes Project generated and made available sequence data for more than 1,000 individuals in an effort to further characterize human genetic variation. Of course, whole genome sequences include an individual’s complete STR profile. However, since STRs are (by definition) short tandem repeats and massively parallel sequencing reads were initially very short, reconstructing STRs from WGS data wasn’t really feasible. That changed over the past few years as read lengths grew and new analysis tools became available.

One such tool, called lobSTR, enables one to extract STR genotypes from genomic data. So one can go from freely available sequence data to the donating individual’s STR profile.

Genealogy Databases Link Y-STRs to Surnames

Though interest in STRs for research use has waned in recent years — there are denser and more prevalent markers that are better for genetic mapping — they remain inexpensive to genotype, and they’re still useful for forensics, paternity testing, and genealogy. Public interest in the latter has grown considerably in the past years; services such as Ancestry.com enable one trace a lineage back for many generations using archives of public records.

Many genealogy enthusiasts have uploaded their information (and STR profiles) to public databases. Two of the largest such databases contain ~135,000 Y-STR profiles linked to ~39,000 surnames. The distribution of surnames correlates well with their frequency in the U.S., suggesting that the population is well-represented. More importantly, these databases are freely accessible to anyone with an internet connection and some Y-STR alleles to put in.

Gymrek et al developed a “surname recovery algorithm” which, given an individual’s STR genotypes, mined these databases to retrieve the most likely surname. They challenged the algorithm to infer the surnames of 911 individuals based on the STR profiles (their true surnames were known) and used the results to tune it for sensitivity. All told, their analysis projects a 12% success rate for recovering the surnames of U.S. Caucasian males.

Identifying the Individual: Age and State of Residence

The final link that enabled Gymrek and colleagues to breach the privacy of anonymous sample donors was the simple fact that age (and therefore year of birth) and place of residence are not considered protected health information (PHI). The name, age, and place of residence of millions of individuals are a matter of public record. In the case of the CEPH Utah samples, we all know they’re individuals of Northern/Western European descent living in Utah, and the age of the donors at time of donation was published by the sample repository (Coriell).

Using 10 CEU samples that had decent STR genotypes, the authors recovered top-matching surname records with Mormon ancestry for 8 of them. That, combined with the demographic information, let them fully identify 5 of 10 CEU samples and (because Coriell provides pedigree information) their families. That’s around 50 individuals, all told, whose privacy was breached using essentially a couple of informatics tools and a web browser.

Privacy, Trust, and Policy

I was initially displeased when I first heard about this study. Why would someone do such a thing as threaten the security of any research samples, especially ones in which the genomics community has invested substantial money and time? Reading the paper set some of my concerns at ease: in all cases, the informed consents of sample donors stated privacy breach as a potential risk, and the data usage terms did not explicitly prevent re-identification. Further, the authors politely notified representatives of the funding organizations involved before the study came out, giving them an opportunity to (1) take immediate action to strengthen privacy of CEPH samples, and (2) write an opinion piece on the complexities of genomic identifiability, which appears in the same issue of Science.

Moreover, the motives of the authors seem pure. Senior author Yaniv Erlich, according to what I’ve read, once worked as a “white hat” — a hacker paid by corporations to identify their IT weaknesses. That’s essentially what this study represents: an early warning shot about our growing inability to safeguard the privacy of sample donors. The risk of surname inference by these or similar methods, as mentioned by the authors, is only going to grow. Sequencing and informatics technologies will continue to improve, enabling more complete and accurate STR genotyping with sequence data.

Attempting to restrict genetic genealogy information is simply not a viable solution: the data are scattered all over the web already, and archived, undoubtedly, on many servers. Anything, and mark my words here, anything that appears on the web is no longer secure information. I don’t care how briefly it had appeared, or what you think your Facebook “security settings” are. Groups and individuals are actively crawling and archiving web content. As much as I hate to say it, if Yaniv Erlich’s group can accomplish this, notify officials, and get it published, all kinds of people with less noble motives can do so as well. Some probably have already.

At the same time, data sharing is crucial to the success of genomic research. How can we continue it, while still working to protect the privacy of sample donors? Controlled access is one strategy that can help mitigate data misuse. Ultimately, however, we will need bulletproof legislation that prevents abuse of genetic information. In the U.S., we already have legislation that prevents health insurance companies from discriminating against individuals based on genetic information, but it doesn’t apply to life insurance or long-term care policies. Unless such protections are put in place, insurance companies will find ways to obtain and use genetic information to increase profit margins. It’s what they do.

Genetic research is absolutely dependent on sample donors and volunteers. The NIH/NHGRI authors of the policy piece are right when they point out that the researcher-participant relationship is all about trust. We get exactly one chance to get this right. Education will be a big part of it. Simply put, the volunteers for past and future research need to be aware that re-identification can happen, even with excellent safeguards in place. Conducting ethical research, respecting privacy, and keeping our promises will also help help maintain the trust of research participants. Because without them, we have nothing to sequence, and nothing to study.

One day, we might very well be able to predict much about an individual’s health from his or her genetic information. Unless the right protections are in place, that fantastic achievement could be subverted, as it is in GATTACA. And let me tell you, that was a grim world indeed.
References
Gymrek M, McGuire AL, Golan D, Halperin E, & Erlich Y (2013). Identifying personal genomes by surname inference. Science (New York, N.Y.), 339 (6117), 321-4 PMID: 23329047

Trackbacks

Links 1/30/13 | Mike the Mad Biologist says:

January 30, 2013 at 3:51 pm

[…] with half truths, omissions, lies Bayesian vs. Frequentist: Is there any “there” there? Identifying Samples from Genomic Data On […]
Genetic basis of Complex Human Diseases: Dan Koboldt’s Advice to Next-Generation Sequencing Neophytes « Pharmaceutical Intelligence says:

February 21, 2013 at 7:23 am

[…] Thou shalt honor thy patients and their samples. Earlier this month, I wrote about how supposedly anonymous individuals from the CEPH collection were identified using a combination of genetic markers and online databases. It is a simple fact that we can no […]