Wed, Oct 14, 2009 - Page 9 News List

A looming data glut threatens computer science

While we’re still starting to think about buying external hard drives for our PCs, scientists are now thinking in petabytes

By Ashlee Vance  /  NY TIMES NEWS SERVICE , MOUNTAIN VIEW, CALIFORNIA

It is a rare criticism of elite American university students that they do not think big enough. But that is exactly the complaint from some of the largest technology companies and the federal government.

At the heart of this criticism is data. Researchers and workers in fields as diverse as biotechnology, astronomy and computer science will soon find themselves overwhelmed with information. Better telescopes and genome sequencers are as much to blame for this data glut as are faster computers and bigger hard drives.

While consumers are just starting to comprehend the idea of buying external hard drives for the home capable of storing a terabyte of data, computer scientists need to grapple with data sets thousands of times as large and growing ever larger. (A single terabyte equals 1,000 gigabytes and could store about 1,000 copies of the Encyclopedia Britannica.)

The next generation of computer scientists has to think in terms of what could be described as Internet scale. Facebook, for example, uses more than 1 petabyte of storage space to manage its users’ 40 billion photos. (A petabyte is about 1,000 times as large as a terabyte, and could store about 500 billion pages of text.)

It was not long ago that the notion of one company having anything close to 40 billion photos would have seemed tough to fathom. Google, meanwhile, churns through 20 times that amount of information every single day just running data analysis jobs. In short order, DNA sequencing systems too will generate many petabytes of information a year.

“It sounds like science fiction, but soon enough, you’ll hand a machine a strand of hair, and a DNA sequence will come out the other side,” said Jimmy Lin, an associate professor at the University of Maryland, during a technology conference last week.

The big question is whether the person on the other side of that machine will have the wherewithal to do something interesting with an almost limitless supply of genetic information.

At the moment, companies like IBM and Google have their doubts.

For the most part, university students have used rather modest computing systems to support their studies. They are learning to collect and manipulate information on personal computers or what are known as clusters, where computer servers are cabled together to form a larger computer. But even these machines fail to churn through enough data to really challenge and train a young mind meant to ponder the mega-scale problems of tomorrow.

“If they imprint on these small systems, that becomes their frame of reference and what they’re always thinking about,” said Jim Spohrer, a director at IBM’s Almaden Research Center.

Two years ago, IBM and Google set out to change the mindset at universities by giving students broad access to some of the largest computers on the planet. The companies then outfitted the computers with software that Internet companies use to tackle their toughest data analysis jobs.

And, rather than building a big computer at each university, the companies created a system that let students and researchers tap into giant computers over the Internet.

This year, the National Science Foundation, a federal government agency, issued a vote of confidence for the project by splitting US$5 million among 14 universities that want to teach their students how to grapple with big data questions.

This story has been viewed 3618 times.

Comments will be moderated. Keep comments relevant to the article. Remarks containing abusive and obscene language, personal attacks of any kind or promotion will be removed and the user banned. Final decision will be at the discretion of the Taipei Times.

TOP top