Were you to run perhaps the most famous line in literature, the opening sentence of Anna Karenina, through Google Translate from Russian to English, this is what you would get: “All happy families resemble one another, each unhappy family is unhappy in its own way.”
The translation, which approximates to the best “human” version of the sentence, looks like a triumph for what used to be called artificial intelligence and is now called, less ambitiously, machine learning. The computer can understand language, we are invited to think. Run the subsequent lines of Anna Karenina through the system, though, and the picture, along with the grammar, is not quite so clear:
“All mixed up in a house Oblonskys. Wife found out that my husband was in connection with the former in their house, a French governess, and told my husband that he could not live with him in the same house. The situation is now lasted three days and were painfully conscious of themselves and their spouses...”
It’s just about explicable, if we know the original, but barely readable. The reason for this discrepancy lies in one of the nuances of Google’s system that allows interested users to improve translated texts where they can. Somebody has obviously got to the first line of Tolstoy’s masterpiece and put it right. What follows is more representative of what the system is capable of.
Ever since computers were a reality, the possibility of using their logistical power to break down barriers of language has been something of a holy grail in machine learning. The initial — unsuccessful — attempts were based on the principle that all languages could be distilled into two components: a lexicon of words with specific meanings, and a set of rules of grammar and syntax by which those words were linked together. The Cold War prompted ambitious efforts by US intelligence agencies to understand the “code” of the Russian language on an industrial scale. It produced mostly gibberish.
The first significant breakthrough in the potential of mechanized translation came in the early 1990s, when IBM produced a model that abandoned any effort to have the computer “understand” what was being fed into it and instead approached the task by installing in the computer the comparative versions of as much translated text as possible and having the system compute the probability of meanings of words and phrases based on statistical precedent.
The approach was pioneered by Frederick Jelinek at IBM, who, distrusting models that grew from analogies with human learning of grammar, insisted: “Whenever I fire a linguist, the performance of our system improves.”
A decade or so later, though, the statistical-based system was becoming severely limited, particularly so when it attempted translations from languages in which there was comparatively little text to “learn” as reference. It was at this point that Google entered the field in earnest.
The impetus for Google’s translation machine can be traced, corporate legend has it, to a particular meeting at the company’s California headquarters in 2004. One of the search engine’s founders, Sergey Brin, had received a fan letter from a user in South Korea.
He understood that the message was in praise of the innovative scope of his company, but when Brin ran it through the machine translation service that Google had then licensed, it read: “The sliced raw fish shoes it wishes. Google green onion thing!”