How do 87 million records scraped from Facebook become an advertising campaign that could help swing an election? What does gathering that much data actually involve? And what does that data tell us about ourselves?
The Cambridge Analytica scandal has raised question after question, but for many, the technological unique selling point of the company, which last week announced that it was closing its operations, remains a mystery.
For those 87 million people probably wondering what was actually done with their data, I went back to Christopher Wylie, the ex-Cambridge Analytica employee who blew the whistle on the company’s problematic operations in the Observer.
Illustration: Mountain people
According to Wylie, all you need to know is a little bit about data science, a little bit about bored rich women and a little bit about human psychology.
Step one, he said over the phone as he scrambled to catch a train: “When you’re building an algorithm, you first need to create a training set.”
That is: No matter what you want to use fancy data science to discover, you first need to gather the old-fashioned way. Before you can use Facebook likes to predict a person’s psychological profile, you need to get a few hundred thousand people to do a 120-question personality quiz.
The “training set” refers to that data in its entirety: the Facebook likes, the personality tests and everything else you want to learn from.
Most important, it needs to contain your “feature set,” which is the “underlying data that you want to make predictions on,” Wylie said. “In this case, it’s Facebook data, but it could be, for example, text, like natural language, or it could be clickstream data,” the complete record of your browsing activity on the Web.
“Those are all the features that you want to [use to] predict,” he added.
At the other end, you need your “target variables” — in Wylie’s words: “The things that you’re trying to predict for. So in this case, personality traits or political orientation, or what have you.”
If you are trying to use one thing to predict another, it helps if you can look at both at the same time.
“If you want to know the relationships between Facebook likes in your feature set and personality traits as your target variables, you need to see both,” Wylie said.
THE ‘OCEAN’ MODEL
Facebook data, which lie at the heart of the Cambridge Analytica story, is a fairly plentiful resource in the data science world — and certainly was back in 2014, when Wylie first started working in this area.
Personality traits are much harder to get hold of: Despite what the proliferation of BuzzFeed quizzes might suggest, it takes quite a lot to persuade someone to fill in a 120-question survey — the length of the short version of one of the standard psychological surveys, the International Personality Item Pool-NEO.
However, Wylie said that “quite a lot” is relative.
“For some people, the incentive to take a survey is financial. If you’re a student or looking for work, or just want to make US$5, that’s an incentive,” he said.
The actual money handed over “ranged from US$2 to US$4,” while the larger payments went to “groups that were harder to get,” he added.
The group least likely to take a survey, and so earning the most from it, were African-American men.
“Other people take surveys just because they find it interesting, or they are bored. So we over-sampled wealthy white women, because if you live in the Hamptons and have nothing to do in the afternoon, you fill out consumer research surveys,” Wylie said.
The personality surveys use those 120 questions to profile people along five discrete axes — the “five factors” model, popularly called the “OCEAN” model after one common breakdown of the factors: openness to experience, conscientiousness, extraversion, agreeableness and neuroticism.
That model clusters personality traits into distinctions that seem to hold across cultures and across time. For instance, those who describe themselves as “loud” are likely to also describe themselves as “gregarious.” If they agree with that description this year, they are likely to agree with it next year.
That cluster is likely to show up in responses in every language. And if a person responds to it negatively, there are likely to be real, noticeable differences between them and people who answer it positively.
Those features of the model are what make it actually useful for profiling individuals, Wylie said — in contrast to some other popular psychological profiles, such as the Myers-Briggs system.
In the testing phase of the research, Facebook was barely involved. The surveys were offered on commercial data research sites — first Amazon’s Mechanical Turk platform, then a specialist operator called Qualtrics.
The switch was made because Amazon has the issue that “people are overfamiliar with filling out surveys” — so much so that it starts to affect results, Wylie said.
FACEBOOK INFILTRATION
It was only at the very end that Facebook came into play. In order to be paid for their survey, users were required to log in to the site and approve access to the survey app developed by Aleksandr Kogan, the Cambridge University academic whose research into personality profiling using Facebook likes provided the perfect access for Robert Mercer-funded Cambridge Analytica to quickly get in on the field.
Kogan has maintained that Cambridge Analytica assured him that it was using data appropriately, saying that he has been “used as a scapegoat by both Facebook and Cambridge Analytica.”
To a survey user, the process was quick: “You click the app, you go on and then it gives you the payment code,” Wylie said.
However, two very important things happened in those few seconds. First, the app harvested as much data as it could about the user who just logged on. Where the psychological profile is the target variable, the Facebook data is the “feature set”: the information a data scientist has on everyone else, which they need to use in order to accurately predict the features they really want to know.
It also provided personally identifiable information, such as real name, location and contact details — something that was not discoverable through the survey sites themselves.
“That meant you could take the inventory and relate it to a natural person [who is] matchable to the electoral register,” he said.
Second, the app did the same thing for all the friends of the user who installed it. Suddenly the hundreds of thousands of people who you have paid a couple of dollars to fill out a survey, whose personalities are a mystery, become millions of people whose Facebook profiles are an open book.
TRIAL AND ERROR
That is where the final transformation comes in. How do you turn a few hundred thousand personality profiles into a few million? With a lot of computing power and a massive matrix of possibilities.
“Even though your sample size is 300,000 people, give or take, your feature set is like 100 million across,” Wylie said.
Every single Facebook “like” found in the data set becomes its own column in this enormous matrix.
“Even if there is only one instance in the entire set, it’s still a feature,” he said.
“All that data was then put into an ensemble model,” Wylie said. “This is when you use different families or approaches of machine learning, because each of them will have their own strengths and weaknesses... And then they sort of vote, and then you amalgamate the results and come up with a conclusion.”
This is where data science becomes more of a data art: The exact input of each approach to the overall model is not set in stone, and there is no right way to do it. In the academic world, it is sometimes called “training by grad student” — the point where the only thing to do is move forward through laborious trial and error.
Still, it worked well enough, and in the end, “we built 253 algorithms, which meant there were 253 predictions per profiled record,” Wylie said.
The goal was achieved: A model that could effectively take the Facebook likes of its subjects and work backward, filling in the rest of the columns in the spreadsheet to arrive at guesses as to their personalities, political affiliations and more.
By the end of August 2014, Wylie had the first successful outputs: 2.1 million profiled records, from 11 target US states, the plan being that they would be used to communicate and refine messages in Mercer and Steve Bannon-backed US Republican campaigns leading up to the 2016 primaries.
Wylie left before these.
“What that number represents is people who not only have their Facebook data, voter data and consumer data [which was all matched up], but also had an additional 253 predictions or scores that were then appended to their profile,” he said.
‘SECRET SAUCE’
Those 253 predictions were the “secret sauce” that Cambridge Analytica claimed it could offer its customers. Using Facebook, advertisers are limited to broad demographic strokes and a few narrower algorithmically determined categories — whether you like jazz music, or what your favorite sports team is.
However, with 253 further predictions, Cambridge Analytica could craft adverts no one else could: a neurotic, extroverted and agreeable Democrat could be targeted with a radically different message than an emotionally stable, introverted, intellectual one, each designed to suppress their voting intention — even if the same messages, swapped around, would have the opposite effect.
Wylie cited the anodyne political statement that a candidate is in favor of jobs.
“Jobs in the economy is a good example, because it’s a meaningless message. Everyone’s pro-jobs in the economy. So in that sense, using just the message of ‘I am in favor of jobs in the economy,’ or ‘I have a plan to fix jobs in the economy,’ you cannot differentiate yourself from your opponent,” he said.
“But one of the things that we found was that actually when you unpack what is a job for different people, different people engage with constructs with different motivations and value sets that are interrelated with their dispositions,” he added.
What that means in practice is that the same blandishment can be dressed up in different language for different personalities, creating the impression of a candidate who connects with voters on an emotional level.
“If you’re talking to a conscientious person” — one who ranks highly on the “C” part of the Ocean model — “you talk about the opportunity to succeed and the responsibility that a job gives you. If it’s an open person, you talk about the opportunity to grow as a person. Talk to a neurotic person, and you emphasize the security that it gives to my family,” Wylie said.
Thanks to the networked nature of modern campaigning, in theory all these messages can be delivered simultaneously to different groups.
Toward the end of the campaign, once the messaging has settled in, they can even be automated, Mad Libs-style, with an algorithm churning through a thesaurus to find the perfect combination of words to win over different subgroups.
‘PLAIN WHITE TOAST’
Of course, it is not all blandishments. One message used to boost right-wing turnout attacked same-sex marriage.
“It’s funny, because this is so offensive and implicitly homophobic, but it’s a team of gays that created it,” Wylie said. “It was targeting conscientious people. It was a picture of a dictionary and it said: ‘Look up marriage and get back to me.’ For someone who is conscientious, it is a compelling message: a dictionary is a source of order, and a conscientious person is more deferential to structure.”
At a certain point, psychometric targeting moves into the realm of dog-whistle campaigning. Images of walls proved to be really effective in campaigning around immigration, for instance.
“Conscientious people like structure, so for them, a solution to immigration should be orderly, and a wall embodied that. You can create messaging that doesn’t make sense to some people, but makes so much sense to other people. If you show that image, some people wouldn’t get that that’s about immigration, and others immediately would get that,” Wylie said.
The actual issues are simply the “plain white toast” of politics, waiting for the actual flavor to be loaded on, he said.
“No one wants plain white toast,” Wylie said.
The job of the data is to “learn the particular flavor or spice” that will make that toast appealing, he said.
While this was undoubtedly a highly sophisticated targeting machine, questions remain about Cambridge Analytica’s psychometric model — ones Wylie, perhaps, is not best-placed to answer.
When Kogan gave evidence to the British parliament last month, he suggested that it was barely better than chance at applying the right OCEAN scores to individuals.
Maybe that edge is enough to matter — or maybe Cambridge Analytica was selling snake oil. And even if individuals were correctly labeled with the five factors, is advertising to them based on that really as simple as slightly hokey-sounding appeals to love of order, or fear of the other?
That said, there is clearly something in it. Take a look instead at a patent filed in 2012 on “determining user personality characteristics from social networking system communications.”
“Stored personality characteristics may be used as targeting criteria for advertisers ... to increase the likelihood that the user positively interacts with a selected advertisement,” the patent suggests.
Its author? Facebook itself.
Because much of what former US president Donald Trump says is unhinged and histrionic, it is tempting to dismiss all of it as bunk. Yet the potential future president has a populist knack for sounding alarums that resonate with the zeitgeist — for example, with growing anxiety about World War III and nuclear Armageddon. “We’re a failing nation,” Trump ranted during his US presidential debate against US Vice President Kamala Harris in one particularly meandering answer (the one that also recycled urban myths about immigrants eating cats). “And what, what’s going on here, you’re going to end up in World War
Earlier this month in Newsweek, President William Lai (賴清德) challenged the People’s Republic of China (PRC) to retake the territories lost to Russia in the 19th century rather than invade Taiwan. He stated: “If it is for the sake of territorial integrity, why doesn’t [the PRC] take back the lands occupied by Russia that were signed over in the treaty of Aigun?” This was a brilliant political move to finally state openly what many Chinese in both China and Taiwan have long been thinking about the lost territories in the Russian far east: The Russian far east should be “theirs.” Granted, Lai issued
On Tuesday, President William Lai (賴清德) met with a delegation from the Hoover Institution, a think tank based at Stanford University in California, to discuss strengthening US-Taiwan relations and enhancing peace and stability in the region. The delegation was led by James Ellis Jr, co-chair of the institution’s Taiwan in the Indo-Pacific Region project and former commander of the US Strategic Command. It also included former Australian minister for foreign affairs Marise Payne, influential US academics and other former policymakers. Think tank diplomacy is an important component of Taiwan’s efforts to maintain high-level dialogue with other nations with which it does
On Sept. 2, Elbridge Colby, former deputy assistant secretary of defense for strategy and force development, wrote an article for the Wall Street Journal called “The US and Taiwan Must Change Course” that defends his position that the US and Taiwan are not doing enough to deter the People’s Republic of China (PRC) from taking Taiwan. Colby is correct, of course: the US and Taiwan need to do a lot more or the PRC will invade Taiwan like Russia did against Ukraine. The US and Taiwan have failed to prepare properly to deter war. The blame must fall on politicians and policymakers