The genetic data posted online seemed perfectly anonymous — strings of billions of DNA letters from more than 1,000 people. However, all it took was some clever sleuthing on the Web for a genetics researcher to identify five people he randomly selected from the study group. Not only that, he found their entire families, even though the relatives had no part in the study — identifying nearly 50 people.
The researcher did not reveal the names of the people he found, but the exercise, published on Friday in the journal Science, illustrates the difficulty of protecting the privacy of volunteers involved in medical research when the genetic information they provide needs to be public so scientists can use it.
Other reports have identified people whose genetic data was online, but none had done so using such limited information: the long strings of DNA letters, an age and, because the study focused on only US subjects, a state.
“I’ve been worried about this for a long time,” said Barbara Koenig, a researcher at the University of California, San Francisco, who studies issues involving genetic data. “We always should be operating on the assumption that this is possible.”
The data are from an international study, the 1000 Genomes Project, that is collecting genetic information from people around the world and posting it online so researchers can use it freely. It also includes the ages of participants and the regions where they live. That information, a genealogy Web site and Google searches were sufficient to find complete family trees.
While the methods for extracting relevant genetic data from the raw genetic sequence files were specialized enough to be beyond the scope of most laypeople, no one expected it to be so easy to zoom in on individuals.
“We are in what I call an awareness moment,” said Eric Green, director of the National Human Genome Research Institute at the National Institutes of Health (NIH).
There is no easy answer about what to do to protect the privacy of study subjects. Subjects might be made more aware that they could be identified by their DNA sequences. More data could be locked behind security walls, or severe penalties could be instituted for those who invade the privacy of subjects.
“We don’t have any claim to have the answer,” Green said.
And opinions about just what should be done vary greatly among experts.
After seeing how easy it was to find the individuals and their extended families, the NIH removed people’s ages from the public database, making it more difficult to identify them.
However, Jeffrey Botkin, associate vice president for research integrity at the University of Utah, which collected the genetic information of some research participants whose identities were breached, cautioned about overreacting.
Genetic data from hundreds of thousands of people have been freely available online, he said, yet there has not been a single report of someone being illicitly identified.
He added that “it is hard to imagine what would motivate anyone to undertake this sort of privacy attack in the real world.”
However, he said he had serious concerns about publishing a formula to breach subjects’ privacy.
By publishing, he said, the investigators “exacerbate the very risks they are concerned about.”
The project was the inspiration of Yaniv Erlich, a human genetics researcher at the Whitehead Institute, which is affiliated with the Massachusetts Institute of Technology.