The technical foundation of ChatGPT is a large language model (LLM), which at its heart is a “next word” prediction engine: Given a preceding word sequence, it constructs a probability distribution for the immediate next word based on a training text corpus.
The corpus used to train such a large-scale language model typically consists of documents collected from the cyber and physical worlds, including Web pages, books, periodicals, one-time publications, e-mails and instant messages, etc.
During training, each document in the corpus is scanned, word by word, from the beginning to the end.
When word X is scanned, the words preceding X serve as a contextual sequence. X is treated as the prediction target, and a training data pair “contextual text sequence, prediction target” is established.
As the scan goes on, myriad such pairs are thus formed. From these training data pairs, a neural network is used to learn the mathematical correlation model between the contextual text sequences and the words immediately following them. The beauty of this particular approach of building language models is that its training data does not require manual labeling.
Although ChatGPT is based on word-by-word prediction, the quality of its responses to user prompts is surprisingly high, because most of the sentences it produces are grammatically correct, semantically relevant, structurally fluent and sometimes even featuring (not necessarily correct) fresh new ideas.
As far as reading comprehension is concerned, ChatGPT seems able to extract key ideas from each individual article, to compare and contrast the similar and different ideas present in multiple articles, and even synthesize novel ideas for situations that are similar, but not exactly identical, to those explored in the training corpus.
Exactly because ChatGPT generates each word in response to a user prompt by consulting with the “text sequence to word” prediction model, narratives included in the response can sometimes contain factual errors or even be completely fabricated.
For example, asked about The Eagles’ works, ChatGPT might quote a fragment of the lyrics of their famous song Hotel California, and the actual quotation included in the response might turn out to be ChatGPT’s own fabrication.
Nevertheless, it is still quite remarkable that just by simply extracting and applying the co-occurrence relationships between text sequences and words in the training text corpus, ChatGPT is able to respond to a wide variety of user prompts with often jaw-dropping quality.
This seems to validate the famous saying by linguist John Firth: “You shall know a word by the company it keeps.”
ChatGPT’s underlying language model is closely tied with its training text corpus. That is why when the question “What is the relationship between China and Taiwan?” is input into ChatGPT in traditional and simplified characters, the answers it outputs are diametrically different.
This suggests that, if future Taiwanese ChatGPT-based applications need a Chinese LLM, they cannot depend on one developed by China, for ideological considerations and national security concerns.
However, the training material that OpenAI uses to train its Chinese LLM might not be sufficiently comprehensive or frequently refreshed. For example, if one intends to use ChatGPT to create scripts for Taiwanese TV dramas, then its underlying LLM must be augmented with additional training based on Taiwanese dialogue data.
Similarly, if one wants to apply ChatGPT to analyzing Taiwanese court judgements to automatically identify abnormal or inconsistent ones, the underlying LLM must be further trained on a corpus made up of past court judgements. These cases suggest that Taiwan should own its LLM to guarantee it is fully localized and always kept up to date.
It is expected that ChatGPT-based applications will pop up all over the place in Taiwan. If they are all built on OpenAI’s ChatGPT, the economic cost associated with the application programming interface calls to OpenAI is going to be enormous, especially when accumulated over multiple decades.
If the government develops its own Chinese LLM based on text materials of Taiwanese origin and make it available for domestic artificial intelligence (AI) text application developers, this infrastructural investment would form the backbone of, and make a gargantuan contribution to, the effective development of its digital industry in the coming decades.
Chiueh Tzi-cker is a joint appointment professor in the Institute of Information Security at National Tsing Hua University.
The Chinese Nationalist Party (KMT) has a good reason to avoid a split vote against the Democratic Progressive Party (DPP) in next month’s presidential election. It has been here before and last time things did not go well. Taiwan had its second direct presidential election in 2000 and the nation’s first ever transition of political power, with the KMT in opposition for the first time. Former president Chen Shui-bian (陳水扁) was ushered in with less than 40 percent of the vote, only marginally ahead of James Soong (宋楚瑜), the candidate of the then-newly formed People First Party (PFP), who got almost 37
At their recent summit in San Francisco, US President Joe Biden and Chinese President Xi Jinping (習近平) made progress in a few key areas. Notably, they agreed to resume direct military-to-military communications — which China had suspended last year, following a visit by then-speaker of the US House of Representatives Nancy Pelosi to Taiwan — to reduce the chances of accidental conflict. However, neither leader was negotiating from a particularly strong position: As Biden struggles with low approval ratings, Xi is overseeing a rapidly weakening economy. The economic news out of China has been poor for some time. Growth is slowing;
Chinese Nationalist Party (KMT) presidential candidate and New Taipei City Mayor Hou You-yi (侯友宜) has called on his Democratic Progressive Party (DPP) counterpart, William Lai (賴清德), to abandon his party’s Taiwanese independence platform. Hou’s remarks follow an article published in the Nov. 30 issue of Foreign Affairs by three US-China relations academics: Bonnie Glaser, Jessica Chen Weiss and Thomas Christensen. They suggested that the US emphasize opposition to any unilateral changes in the “status quo” across the Taiwan Strait, and that if Lai wins the election, he should consider freezing the Taiwanese independence clause. The concept of de jure independence was first
Ratings agency Moody’s Investors Service on Tuesday last week cut its outlook for China’s credit rating to “negative” from “stable,” citing risks from a slowing economy, increasing local government debts and a continued slump in the Chinese property market. Wasting little time, the agency on Wednesday also downgraded its credit outlooks for Hong Kong and Macau to “negative” from “stable,” citing the territories’ tight political, institutional, economic and financial linkages with China. While Moody’s reaffirmed its “A1” sovereign rating for China, the outlook downgrade was its first for the country since 2017, reflecting the agency’s pessimistic view of China’s mounting debts