Like millions worldwide, Southeast Asians have been trying out large language models such as Meta’s Llama 2 and Mistral AI — but in their native Bahasa Indonesia or Thai. The result has usually been gibberish in English.
This leaves them at a disadvantage, tech experts warn, as generative artificial intelligence (AI) transforms education, work and governance worldwide.
A Singapore government-led initiative aims to correct the imbalance with a Southeast Asian large language model (LLM), the first in a family of models named SEA-LION — Southeast Asian Languages in One Network — trained in the region’s languages and cultural norms.
Trained on data in 11 Southeast Asian languages including Vietnamese, Thai and Bahasa Indonesia, the open-sourced model is a cheaper and more efficient option for the region’s businesses, governments and academia, AI Singapore’s senior director for AI products Leslie Teo said.
“Do we want to force every person in Southeast Asia to adapt to the machine, or do we want to make it more accessible so people in the region can make full use of the technology without having to be an English speaker?” Teo said.
“We are not trying to compete with the big LLMs. We are trying to complement them, so there can be better representation of us,” he said.
There are more than 7,000 languages spoken worldwide. Yet LLMs including Open AI’s GPT-4 and Meta’s Llama 2, used to build AI systems such as chatbots and other tools, have largely been developed for and trained on the English language.
Governments and tech firms are trying to bridge this gap, with India creating datasets in local languages, an LLM in the United Arab Emirates powering generative AI tools in Arabic, and AI models in China, Japan and Vietnam in local languages.
These models could help local populations participate more equitably in the global AI economy that is largely dominated by big tech firms, said Nuurrianti Jalli, an assistant professor at Oklahoma State University’s school of communications.
“Regional LLMs are also needed because they support technology self-reliance,” she said. “Less reliance on Western LLMs could provide better privacy for local populations, and also align better with national or regional interest,” Jalli said.
Multilingual language models that are trained on text from several languages at once, can infer semantic and grammatical connections between high resource languages that have more data, and low resource languages, researchers say.
These models can be used in a variety of applications from translation to customer-service chatbots, to content moderation on social media platforms that have struggled to identify hate speech in low resource languages such as Burmese or Amharic.
About 13 percent of SEA-LION’s data is sourced from Southeast Asian languages — more than any other major LLM, Teo said, adding that more than 9 percent of its data is from Chinese text and about 63 per from English.
Multilingual language models, often train on translated text and other poor quality data that might have errors, so AI Singapore is “careful” about the data used in training SEA-LION, Teo said in his office at the National University of Singapore.
“The age of pristine data has passed — a lot of the stuff on the Internet now is LLM-generated material, so we need to verify and filter,” he said.
“We cannot be perfect, but we also cannot take out everything we consider to be bad,” he added.
More governments are contributing data, and businesses are testing SEA-LION, which can be deployed faster and is cheaper to fine-tune and adopt due to its smaller size, he said.
At Indonesian e-commerce company Tokopedia, a majority of customer interactions is in Bahasa Indonesia, so models “with that local fluency will enhance our ability to connect with customers and improve their experiences,” Tokopedia associate vice president of data science Paul Condylis said.
As more countries and regions build their own LLMs, digital and human rights experts fret that they would only reproduce dominant views expressed online, which could be particularly problematic in nations with authoritarian governments or strict media censorship, or those lacking a strong civil society.
Chinese social media platforms, for example, censor references to the Tiananmen Square uprising and criticism of the government, while several Southeast Asian nations have enacted laws to curb content that authorities deem misleading.
“Training models on such data risks perpetuating biased, prejudiced, incomplete and even misleading narratives,” Jalli said.
“The models may fail to surface important socio-political issues like human rights abuse, corruption, or valid criticism of political powers,” she said.
For example, in response to a query on Indonesian former president Suharto, Llama 2 and GPT-4 mentioned his spotty human rights record, while SEA-LION’s response focused largely on his achievements.
If a model is only trained on favorable articles about a government, then the model is “likely to adopt a worldview where the government is wholly positive and leave out dissenting viewpoints,” said Aliya Bhatia, a policy analyst at the Center for Democracy & Technology, a US non-profit.
“Regional LLMs might better reflect the linguistic and cultural nuances of local language speakers, but they might also have less information about the world in general,” she added.
“There is a real risk of government-backed models instilling a revisionist view of history and undermining democratic values.”
However, the alternative — relying entirely on Western LLMs with “disproportionately large influences” from wealthy, liberal, western democracies — means perpetuating different biases related to cultural values, political beliefs and social norms, AI Singapore said.
“These LLMs have a very particular West Coast American bias — they are very woke. They do not represent us,” said Teo.
“We are not saying ours is the only perspective — we are just trying to rebalance it.”
With much pomp and circumstance, Cairo is today to inaugurate the long-awaited Grand Egyptian Museum (GEM), widely presented as the crowning jewel on authorities’ efforts to overhaul the country’s vital tourism industry. With a panoramic view of the Giza pyramids plateau, the museum houses thousands of artifacts spanning more than 5,000 years of Egyptian antiquity at a whopping cost of more than US$1 billion. More than two decades in the making, the ultra-modern museum anticipates 5 million visitors annually, with never-before-seen relics on display. In the run-up to the grand opening, Egyptian media and official statements have hailed the “historic moment,” describing the
SECRETIVE SECT: Tetsuya Yamagami was said to have held a grudge against the Unification Church for bankrupting his family after his mother donated about ¥100m The gunman accused of killing former Japanese prime minister Shinzo Abe yesterday pleaded guilty, three years after the assassination in broad daylight shocked the world. The slaying forced a reckoning in a nation with little experience of gun violence, and ignited scrutiny of alleged ties between prominent conservative lawmakers and a secretive sect, the Unification Church. “Everything is true,” Tetsuya Yamagami said at a court in the western city of Nara, admitting to murdering the nation’s longest-serving leader in July 2022. The 45-year-old was led into the room by four security officials. When the judge asked him to state his name, Yamagami, who
DEADLY PREDATORS: In New South Wales, smart drumlines — anchored buoys with baited hooks — send an alert when a shark bites, allowing the sharks to be tagged High above Sydney’s beaches, drones seek one of the world’s deadliest predators, scanning for the flick of a tail, the swish of a fin or a shadow slipping through the swell. Australia’s oceans are teeming with sharks, with great whites topping the list of species that might fatally chomp a human. Undeterred, Australians flock to the sea in huge numbers — with a survey last year showing that nearly two-thirds of the population made a total of 650 million coastal visits in a single year. Many beach lovers accept the risks. When a shark killed surfer Mercury Psillakis off a northern Sydney beach last
‘NO WORKABLE SOLUTION’: An official said Pakistan engaged in the spirit of peace, but Kabul continued its ‘unabated support to terrorists opposed to Pakistan’ Pakistan yesterday said that negotiations for a lasting truce with Afghanistan had “failed to bring about a workable solution,” warning that it would take steps to protect its people. Pakistan and Afghanistan have been holding negotiations in Istanbul, Turkey, aimed at securing peace after the South Asian neighbors’ deadliest border clashes in years. The violence, which killed more than 70 people and wounded hundreds, erupted following explosions in Kabul on Oct. 9 that the Taliban authorities blamed on Pakistan. “Regrettably, the Afghan side gave no assurances, kept deviating from the core issue and resorted to blame game, deflection and ruses,” Pakistani Minister of