He stresses that he is a strong advocate of data sharing and would hate to see genomic data locked up. However, when his lab developed a new technique, he realized he had the tools to probe a DNA database. And he could not resist trying.
The tool allowed him to quickly find a type of DNA pattern that looks like stutters among billions of chemical letters in human DNA. Those little stutters — short tandem repeats — are inherited.
Genealogy Web sites use repeats on the Y chromosome, the one unique to men, to identify men by their surnames, an indicator of ancestry. Any man can submit the short tandem repeats on his Y chromosome and find the surname of men with the same DNA pattern. The sites enable men to find their ancestors and relatives.
So, Erlich asked, could he take a man’s entire DNA sequence, pick out the short tandem repeats on his Y chromosome, search a genealogy site, discover the man’s surname and then fully identify the man?
He tested it with the genome of Craig Venter, a DNA sequencing pioneer who posted his own DNA sequence on the Web. He knew Venter’s age and the state where he lives. Bingo: Two men popped up in the database. One was Craig Venter.
“Out of 300 million people in the United States, we got it down to two people,” Erlich said.
He and his colleagues calculated they would be able to identify, from just their DNA sequences, the last names of approximately 12 percent of middle class and wealthier white men — the population that tends to submit DNA data to recreational sites like the genealogical ones. Then by combining the men’s last names with their ages and the states where they lived, the researchers should be able to narrow their search to just a few likely individuals.
On the Web and publicly available are DNA sequences from subjects in an international collaboration, the 1000 Genomes Project. People’s ages were included and all the Americans lived in Utah, so the researchers knew their state.
Erlich began with one man from the database. He got the Y chromosome’s short tandem repeats and then went to genealogy databases and searched for men with those same repeats. He got surnames of the paternal and maternal grandfather. Then he did a Google search for those people and found an obituary. That gave him the family tree.
“Oh my God, we really did this,” Erlich said. “I had to digest it. We had so much information.”
He and his colleagues went on to get detailed family trees for other subjects and then visited Green and his colleagues at the NIH to tell them what they had done.
They were referred to Amy McGuire, a lawyer and ethicist at Baylor College of Medicine in Houston, Texas. She, like others, called for more public discussion of the situation.
“To have the illusion you can fully protect privacy or make data anonymous is no longer a sustainable position,” McGuire said.
When the subjects in the 1000 Genomes Project agreed to participate and provide DNA, they signed a form saying that the researchers could not guarantee their privacy. However, at the time, it seemed like so much boilerplate.
The risk, Green said, seemed “remote.”
“I don’t know that anyone anticipated that someone would go and actually figure out who some of those people were,” McGuire said.