Coffee was wildly popular in Sweden in the 17th century — and also illegal. King Gustav III believed that it was a slow poison and devised a clever experiment to prove it. He commuted the sentences of murderous twin brothers who were waiting to be beheaded, on one condition: One brother had to drink three pots of coffee every day while the other drank three pots of tea. The early death of the coffee drinker would prove that coffee was poison.
It turned out that the coffee-drinking twin outlived the tea drinker, but it was not until the 1820s that Swedes were finally legally permitted to do what they had been doing all along — drink coffee, lots of coffee.
The cornerstone of the scientific revolution is the insistence that claims be tested with data, ideally in a randomly controlled trial. Gustav’s experiment was noteworthy for his use of identical male twins, which eliminated the confounding effects of sex, age and genes. The most glaring weakness was that nothing statistically persuasive can come from such a small sample.
Today, the problem is not the scarcity of data, but the opposite. We have too much data, and it is undermining the credibility of science.
Luck is inherent in random trials. In a medical study, some patients might be healthier. In an agricultural study, some soil might be more fertile. In an educational study, some students might be more motivated. Researchers consequently calculate the probability (the p-value) that the outcomes might happen by chance. A low p-value indicates that the results cannot easily be attributed to the luck of the draw.
How low? In the 1920s, the great British statistician Ronald Fisher said that he considered p-values below 5 percent to be persuasive and, so, 5 percent became the hurdle for the “statistically significant” certification needed for publication, funding and fame.
It is not a difficult hurdle. Suppose that a hapless researcher calculates the correlations among hundreds of variables, blissfully unaware that the data are all, in fact, random numbers. On average, one out of 20 correlations will be statistically significant, even though every correlation is nothing more than coincidence.
Real researchers do not correlate random numbers but, all too often, they correlate what are essentially randomly chosen variables. This haphazard search for statistical significance even has a name: data mining. As with random numbers, the correlation between randomly chosen, unrelated variables has a 5 percent chance of being fortuitously statistically significant. Data mining can be augmented by manipulating, pruning and otherwise torturing the data to get low p-values.
To find statistical significance, one need merely look sufficiently hard. Thus, the 5 percent hurdle has had the perverse effect of encouraging researchers to do more tests and report more meaningless results.
Thus, silly relationships are published in good journals simply because the results are statistically significant.
Students do better on a recall test if they study for the test after taking it (Journal of Personality and Social Psychology). Japanese-Americans are prone to heart attacks on the fourth day of the month (British Medical Journal). Bitcoin prices can be predicted from stock returns in the paperboard, containers and boxes industry (National Bureau of Economic Research). Elderly Chinese women can postpone their deaths until after the celebration of the Harvest Moon Festival (Journal of the American Medical Association). Women who eat breakfast cereal daily are more likely to have male babies (Proceedings of the Royal Society). People can use power poses to increase their dominance hormone testosterone and reduce their stress hormone cortisol (Psychological Science). Hurricanes are deadlier if they have female names (Proceedings of the National Academy of Sciences). Investors can obtain a 23 percent annual return in the market by basing their buy/sell decisions on the number of Google searches for the word “debt” (Scientific Reports).
These now-discredited studies are the tip of a statistical iceberg that has come to be known as the replication crisis.
A team led by John Ioannidis looked at attempts to replicate 34 highly respected medical studies and found that only 20 were confirmed. The Reproducibility Project attempted to replicate 97 studies published in leading psychology journals and confirmed only 35. The Experimental Economics Replication Project attempted to replicate 18 experimental studies reported in leading economics journals and confirmed only 11.
I wrote a satirical paper that was intended to demonstrate the folly of data mining. I looked at former US president Donald Trump’s voluminous Twitter posts and found statistically significant correlations between: Trump tweeting the word “president” and the S&P 500 index two days later; Trump tweeting the word “ever” and the temperature in Moscow four days later; Trump tweeting the word “more” and the price of tea in China four days later; and Trump tweeting the word “democrat” and some random numbers I had generated.
I concluded — tongue as firmly in cheek as I could hold it — that I had found “compelling evidence of the value of using data-mining algorithms to discover statistically persuasive, heretofore unknown correlations that can be used to make trustworthy predictions.”
I naively assumed that readers would get the point of this nerd joke: Large data sets can easily be mined and tortured to identify patterns that are utterly useless. I submitted the paper to an academic journal and the reviewer’s comments demonstrate beautifully how deeply embedded is the notion that statistical significance supersedes common sense: “The paper is generally well written and structured. This is an interesting study and the authors have collected unique datasets using cutting-edge methodology.”
It is tempting to believe that more data means more knowledge.
However, the explosion in the number of things that are measured and recorded has magnified beyond belief the number of coincidental patterns and bogus statistical relationships waiting to deceive us.
If the number of true relationships yet to be discovered is limited, while the number of coincidental patterns is growing exponentially with the accumulation of more and more data, then the probability that a randomly discovered pattern is real is inevitably approaching zero.
The problem today is not that we have too few data, but that we have too much data, which seduces researchers into ransacking it for patterns that are easy to find, likely to be coincidental, and unlikely to be useful.
Gary Smith, an economics professor at Pomona College, is the author of The AI Delusion and the forthcoming Distrust: Big Data, Data-Torturing, and the Assault on Science.
This column does not necessarily reflect the opinion of the editorial board or Bloomberg LP and its owners.
Apart from the first arms sales approval for Taiwan since US President Donald Trump took office, last month also witnessed another milestone for Taiwan-US relations. Trump signed the Taiwan Assurance Implementation Act into law on Tuesday. Its passing without objection in the US Senate underscores how bipartisan US support for Taiwan has evolved. The new law would further help normalize exchanges between Taiwanese and US government officials. We have already seen a flurry of visits to Washington earlier this summer, not only with Minister of Foreign Affairs Lin Chia-lung (林佳龍), but also delegations led by National Security Council Secretary-General Joseph Wu
When the towers of Wang Fuk Court turned into a seven-building inferno on Wednesday last week, killing 128 people, including a firefighter, Hong Kong officials promised investigations, pledged to review regulations and within hours issued a plan to replace bamboo scaffolding with steel. It sounded decisive. It was not. The gestures are about political optics, not accountability. The tragedy was not caused by bamboo or by outdated laws. Flame-retardant netting is already required. Under Hong Kong’s Mandatory Building Inspection Scheme — which requires buildings more than 30 years old to undergo inspection every decade and compulsory repairs — the framework for
Ho Ying-lu (何鷹鷺), a Chinese spouse who was a member of the Chinese Nationalist Party’s (KMT) Central Standing Committee, on Wednesday last week resigned from the KMT, accusing the party of failing to clarify its “one China” policy. In a video released in October, Ho, wearing a T-shirt featuring a portrait of Mao Zedong (毛澤東), said she hoped that Taiwan would “soon return to the embrace of the motherland” and “quickly unify — that is my purpose and my responsibility.” The KMT’s Disciplinary Committee on Nov. 19 announced that Ho had been suspended from her position on the committee, although she was
Two mayors have invited Japanese pop icon Ayumi Hamasaki to perform in their cities after her Shanghai concert was abruptly canceled on Saturday last week, a decision widely interpreted as fallout from the latest political spat between Japan and China. Organizers in Shanghai pulled Hamasaki’s show at the last minute, citing force majeure, a justification that convinced few. The cancelation came shortly after Japanese Prime Minister Sanae Takaichi remarked that a Chinese attack on Taiwan could prompt a military response from Tokyo — comments that angered Beijing and triggered a series of retaliatory moves. Hamasaki received an immediate show of support from