What it means to build local AI language models - Taipei Times

Sun, Jun 29, 2025 page7

What it means to build local AI language models
- By Elina Noor
Following OpenAI’s public launch of ChatGPT in November 2022, the underpinnings of artificial intelligence (AI) large language models (LLMs) seemed firmly “WIRED”: Western, industrialized, rich, educated and democratic. Everyone assumed that if LLMs spoke a particular language and reflected a particular worldview, it would be a Western one. OpenAI even acknowledged ChatGPT’s skew toward Western views and the English language.
However, even before OpenAI’s US competitors (Google and Anthropic) released their own LLMs the following year, Southeast Asian developers had recognized the need for AI tools that would speak to their own region in its many languages — no small task, given that more than 1,200 languages are spoken there.
Moreover, in a region where distant civilizational memories often collide with contemporary, post-colonial histories, language is profoundly political. Even seemingly monolingual countries belie marked diversity: Cambodians speak about 30 languages; Thais, about 70; and Vietnamese, more than 100. This is also a region where communities mix languages seamlessly, where nonverbal cues speak volumes, and where oral traditions are sometimes more prevalent than textual means of capturing the deep cultural and historical nuances that have been encoded in language.
Not surprisingly, those trying to build truly local AI models for a region with so many under-represented languages have faced many obstacles, from a paucity of high-quality, high-quantity annotated data to a lack of access to the computing power needed to build and train models from scratch. In some cases, the challenges are even more basic, reflecting a shortage of native speakers and standardized orthography or frequent electricity supply disruptions.
Given these constraints, many of the region’s AI developers have settled for fine-tuning established models built by foreign incumbents. This involves taking a pretrained model that has been fed large quantities of data and training it on a smaller dataset for a specific skill or task. Between 2020 to 2023, Southeast Asian language models such as PhoBERT (Vietnamese), IndoBERT (Indonesian), and Typhoon (Thai) were derived from much larger models such as Google’s BERT; Meta’s RoBERTa (later LLaMA), and France’s Mistral. Even the early versions of SeaLLM, a suite of models optimized for regional languages and released by Alibaba’s DAMO Academy, were built on Meta, Mistral and Google’s architecture.

However, last year Alibaba Cloud’s Qwen disrupted this Western dominance, offering Southeast Asia a wider set of options. A Carnegie Endowment for International Peace study found that five of the 21 regional models launched that year were built on Qwen.
Still, just as Southeast Asian developers previously had to account for a latent Western bias in the available foundation models, now they must be mindful of the ideologically filtered perspectives embedded in pretrained Chinese models. Ironically, efforts to localize AI and ensure greater agency for Southeast Asian communities could deepen developers’ dependence on much larger players, at least in the initial stages.
Nonetheless, Southeast Asian developers have begun to address this problem, too. Multiple models, including SEA-LION (a collection of 11 official regional languages), PhoGPT (Vietnamese), and MaLLaM (Malay), have been pretrained from scratch on a large, generic dataset of each particular language. This key step in the machine-learning process would allow these models to be further fine-tuned for specific tasks.
Although SEA-LION continues to rely on Google’s architecture for its pretraining, its use of a regional language dataset has facilitated the development of homegrown models such as Sahabat-AI, which communicates in Indonesian, Sundanese, Javanese, Balinese and Bataknese. Sahabat-AI proudly describes itself as “a testament to Indonesia’s commitment to AI sovereignty.”
However, representing native perspectives also requires a strong base of local knowledge. We cannot faithfully present Southeast Asian perspectives and values without understanding the politics of language, traditional sense-making and historical dynamics.
For example, time and space — widely understood in the modern context to be linear, divisible, and measurable for the purposes of maximizing productivity — are perceived differently in many indigenous communities. Balinese historical writings that defy conventional patterns of chronology might be viewed as myths or legends in Western terms, but they continue to shape how these communities make sense of the world.
Historians of the region have cautioned that applying a Western lens to local texts heightens the risk of misinterpreting indigenous perspectives. From the 18th to the 19th centuries, Indonesia’s colonial administrators frequently read their own understanding of Javanese chronicles into translated reproductions. As a result, many biased British and European observations of Southeast Asians have come to be treated as valid historical accounts, and ethnic categorizations and stereotypes from official documents have been internalized. If AIs are trained on these data, the biases could end up further entrenched.
Data are not knowledge. Since language is inherently social and political — reflecting the relational experiences of those who use it — asserting agency in the age of AI must go beyond the technical sufficiency of models that communicate in local languages. It requires consciously filtering legacy biases, questioning assumptions about identity and rediscovering indigenous knowledge repositories in languages. We cannot project our cultures faithfully through technology if we barely understand them in the first place.
Elina Noor is a senior fellow in the Asia Program at the Carnegie Endowment for International Peace.
Copyright: Project Syndicate
Most Popular

You might also like
- No, Beijing would not invade due to Venezuela
  
  2026-01-06
  
  We are used to hearing that whenever something happens, it means Taiwan is about to fall to China. Chinese President Xi Jinping (習近平) cannot change the color of his socks without China experts claiming it means an invasion is imminent. So, it is no surprise that what happened in Venezuela over the weekend triggered the knee-jerk reaction of saying that Taiwan is next. That is not an opinion on whether US President Donald Trump was right to remove Venezuelan President Nicolas Maduro the way he did or if it is good for Venezuela and the world. There are other, more qualified
  
  By Julien Oeuillet 曾樂昂
- Taiwan, Japan should join NATO
  
  2026-01-04
  
  China’s recent aggressive military posture around Taiwan simply reflects the truth that China is a millennium behind, as Kobe City Councilor Norihiro Uehata has commented. While democratic countries work for peace, prosperity and progress, authoritarian countries such as Russia and China only care about territorial expansion, superpower status and world dominance, while their people suffer. Two millennia ago, the ancient Chinese philosopher Mencius (孟子) would have advised Chinese President Xi Jinping (習近平) that “people are the most important, state is lesser, and the ruler is the least important.” In fact, the reverse order is causing the great depression in China right now,
  
  By James J. Y. Hsu 許正餘
- Richard D. Fisher, Jr. On Taiwan: 2026: The Year To Lose Our Fear Of China
  
  2026-01-05
  
  This should be the year in which the democracies, especially those in East Asia, lose their fear of the Chinese Communist Party’s (CCP) “one China principle” plus its nuclear “Cognitive Warfare” coercion strategies, all designed to achieve hegemony without fighting. For 2025, stoking regional and global fear was a major goal for the CCP and its People’s Liberation Army (PLA), following on Mao Zedong’s (毛澤東) Little Red Book admonition, “We must be ruthless to our enemies; we must overpower and annihilate them.” But on Dec. 17, 2025, the Trump Administration demonstrated direct defiance of CCP terror with its record US$11.1 billion arms
- EDITORIAL: Taiwan’s view on Venezuela
  
  2026-01-07
  
  The immediate response in Taiwan to the extraction of Venezuelan President Nicolas Maduro by the US over the weekend was to say that it was an example of violence by a major power against a smaller nation and that, as such, it gave Chinese President Xi Jinping (習近平) carte blanche to invade Taiwan. That assessment is vastly oversimplistic and, on more sober reflection, likely incorrect. Generally speaking, there are three basic interpretations from commentators in Taiwan. The first is that the US is no longer interested in what is happening beyond its own backyard, and no longer preoccupied with regions in other

- About Us
- Employment
- Contact Us
- RSS

Copyright © 1999-2026 The Taipei Times. All rights reserved.