Professor Peter Murray-Rust was looking for new ways to make better drugs. Dr Heather Piwowar wanted to track how scientific papers were cited and shared by researchers around the world. Dr Casey Bergman wanted a way for doctors and scientists to quickly navigate the latest research in genetics, to help treat patients and further their research.
All of them needed access to tens of thousands of research papers at once, so they could use computers to look for unseen patterns and associations across the millions of words in the articles. This technique, called text mining, is a vital 21st-century research method. It uses powerful computers to find links between drugs and side effects, or genes and diseases, that are hidden within the vast scientific literature. These are discoveries that a person scouring through papers one by one may never notice.
It is a technique with a lot of potential. A report published by McKinsey Global Institute last year said that “big data” technologies such as text and data mining had the potential to create 250 billion euros (US$314 billion) of annual value for Europe’s economy, if researchers were allowed to make full use of it.
Unfortunately, in most cases, text mining is forbidden. Bergman, Murray-Rust, Piwowar and countless other academics are prevented from using the most modern research techniques because big publishing companies such as Macmillan, Wiley and Elsevier, which control the distribution of most of the world’s academic literature, by default do not allow text mining of the content behind their expensive paywalls.
Any such project requires special dispensation from — and time-consuming individual negotiations with — the scores of publishers that may be involved.
“That’s the key fact which is halting progress in this field,” said Robert Kiley, head of digital services at the Wellcome Trust. “For a lot of people, though there is promise there, the activation effort is just too great.”
The restrictions have led campaigners to view the issue as another front in the battle to make fruits of publicly funded research work available through “open access,” free at the point of use. That would allow researchers to mine the content freely without needing to request any extra permissions.
The scale of new information in modern science is staggering: More than 1.5 million academic articles are published every year and the volume of data doubles every three years. No individual can keep up with such a volume and scientists need computers to help them make sense of the information.
Bergman, an evolutionary biologist at the University of Manchester, used text mining to create a tool to help scientists make sense of the ever-growing research literature on genetics. Though genetic sequences of living organisms are publicly available, discussions of what the sequences do and how they interact with each other sit within the text of scientific papers that are mostly behind paywalls.
Working with Max Haeussler, of the University of California, Santa Cruz, Bergman came up with Text2genome, which identifies strings of text in thousands of papers that look like the letters of a DNA sequence — say, a gene — and links together all papers that mention or discuss that sequence. Text2genome could allow a clinician who may not be an expert on a particular gene to access relevant literature quickly and easily.