For the past decade, artificial intelligence (AI) has been used to recognize faces, rate creditworthiness and predict the weather. At the same time, increasingly sophisticated hacks using stealthier methods have escalated. The combination of AI and cybersecurity was inevitable as both fields sought better tools and new uses for their technology. However, there is a massive problem that threatens to undermine these efforts and could allow adversaries to bypass digital defenses undetected.
The danger is data poisoning: manipulating the information used to train machines offers a virtually untraceable method to circumvent AI-powered defenses. Many companies might not be ready to deal with escalating challenges.
The global market for AI cybersecurity is already expected to triple by 2028 to US$35 billion. Security providers and their clients might have to patch together multiple strategies to keep threats at bay.
Illustration: Louise Ting
The very nature of machine learning, a subset of AI, is the target of data poisoning.
Given reams of data, computers can be trained to categorize information correctly. A system might not have seen a picture of Lassie, but given enough examples of different animals that are correctly labeled by species (and even breed) it should be able to surmise she is a dog. With even more samples, it would be able to correctly guess the breed of the famous TV canine: rough collie.
The computer does not really know. It is merely making statistically informed inference based on past training data.
That same approach is used in cybersecurity. To catch malicious software, companies feed their systems with data and let the machine learn by itself. Computers armed with numerous examples of both good and bad code can learn to look out for malicious software (or even snippets of software) and catch it.
An advanced technique called neural networks — which mimics the structure and processes of the human brain — runs through training data and makes adjustments based on both known and new information.
Such a network need not have seen a specific piece of malevolent code to surmise that it is bad. It has learned for itself and can adequately predict good versus evil.
All of that is powerful, but it is not invincible.
Machine-learning systems require a huge number of correctly labeled samples to start getting good at prediction. Even the largest cybersecurity companies are able to collate and categorize only a limited number of examples of malware, so they have little choice but to supplement their training data. Some of the data can be crowd-sourced.
“We already know that a resourceful hacker can leverage this observation to their advantage,” Giorgio Severi, a doctoral student at Northwestern University, said in a recent presentation at the USENIX Security Symposium.
Using the animal analogy, if feline-phobic hackers wanted to cause havoc, they could label a bunch of photos of sloths as cats, and feed the images into an open-source database of house pets. Since the tree-hugging mammals would appear far less often in a corpus of domesticated animals, this small sample of poisoned data has a good chance of tricking a system into spitting out sloth pics when asked to show kittens.
It is the same technique for more malicious hackers.
By carefully crafting malicious code, labeling these samples as good, and then adding it to a larger batch of data, a hacker can trick a neutral network into surmising that a snippet of software that resembles the bad example is, in fact, harmless.
Catching the miscreant samples is almost impossible. It is far harder for a human to rummage through computer code than to sort pictures of sloths from those of cats.
In a presentation at the Hacks In Taiwan security conference in Taipei last year, researchers Cheng Shin-ming (鄭欣明) and Tseng Ming-huei (曾明慧) showed that backdoor code could fully bypass defenses by poisoning less than 0.7 percent of the data submitted to the machine-learning system.
Not only does it mean that only a few malicious samples are needed, but it indicates that a machine-learning system can be rendered vulnerable even if it uses only a small amount of unverified open-source data.
The industry is not blind to the problem, and this weakness is forcing cybersecurity companies to take a much broader approach to bolstering defenses.
One way to help prevent data poisoning is for scientists who develop AI models to regularly check that all the labels in their training data are accurate.
OpenAI, a research company cofounded by Elon Musk, said that when its researchers curated their data sets for a new image-generating tool, they would regularly pass the data through special filters to ensure the accuracy of each label.
That “removes the large majority of images which are falsely labeled,” a spokeswoman said.
To stay safe, companies need to ensure their data is clean, but that means training their systems with fewer examples than they would get with open source offerings.
In machine learning, sample size matters.
This cat-and-mouse game between attackers and defenders has been going on for decades, with AI simply the latest tool deployed to help the good side stay ahead.
Remember: Artificial intelligence is not omnipotent. Hackers are always looking for their next exploit.
Tim Culpan is a technology columnist for Bloomberg Opinion. Based in Taipei, he writes about Asian and global businesses and trends. He previously covered the beat at Bloomberg News.
On Monday, the day before Chinese Nationalist Party (KMT) Chairwoman Cheng Li-wun (鄭麗文) departed on her visit to China, the party released a promotional video titled “Only with peace can we ‘lie flat’” to highlight its desire to have peace across the Taiwan Strait. However, its use of the expression “lie flat” (tang ping, 躺平) drew sarcastic comments, with critics saying it sounded as if the party was “bowing down” to the Chinese Communist Party (CCP). Amid the controversy over the opposition parties blocking proposed defense budgets, Cheng departed for China after receiving an invitation from the CCP, with a meeting with
Chinese Nationalist Party (KMT) Chairwoman Cheng Li-wun (鄭麗文) is leading a delegation to China through Sunday. She is expected to meet with Chinese President Xi Jinping (習近平) in Beijing tomorrow. That date coincides with the anniversary of the signing of the Taiwan Relations Act (TRA), which marked a cornerstone of Taiwan-US relations. Staging their meeting on this date makes it clear that the Chinese Communist Party (CCP) intends to challenge the US and demonstrate its “authority” over Taiwan. Since the US severed official diplomatic relations with Taiwan in 1979, it has relied on the TRA as a legal basis for all
In the event of a war with China, Taiwan has some surprisingly tough defenses that could make it as difficult to tackle as a porcupine: A shoreline dotted with swamps, rocks and concrete barriers; conscription for all adult men; highways and airports that are built to double as hardened combat facilities. This porcupine has a soft underbelly, though, and the war in Iran is exposing it: energy. About 39,000 ships dock at Taiwan’s ports each year, more than the 30,000 that transit the Strait of Hormuz. About one-fifth of their inbound tonnage is coal, oil, refined fuels and liquefied natural gas (LNG),
Taiwan ranks second globally in terms of share of population with a higher-education degree, with about 60 percent of Taiwanese holding a post-secondary or graduate degree, a survey by the Organisation for Economic Co-operation and Development showed. The findings are consistent with Ministry of the Interior data, which showed that as of the end of last year, 10.602 million Taiwanese had completed post-secondary education or higher. Among them, the number of women with graduate degrees was 786,000, an increase of 48.1 percent over the past decade and a faster rate of growth than among men. A highly educated population brings clear advantages.