Privacy concerns could derail unprecedented plan to use Facebook data to study elections

Facebook employees work to reduce the spread of misinformation that could influence elections.

NOAH BERGER/AFP/Getty Images

By Jeffrey MervisSep. 24, 2019 , 4:50 PM

Gary King benefited from perfect timing in selling Facebook on the idea of sharing a treasure trove of its data with academics. But now, the clock is working against efforts by King and others to keep the innovative project—which aims to better understand how information spread on Facebook influences elections and political institutions around the world—from falling apart. The key sticking point: protecting the privacy of Facebook users.

In March 2018, King, a quantitative social scientist at Harvard University, made a visit to Facebook’s headquarters in Menlo Park, California. The media had just broken the news that a U.K. firm, Cambridge Analytica, had been selling voter profiles to candidates based on personal information provided unwittingly by millions of Facebook users. The resulting scandal was a sobering lesson for Facebook on how not to share its data with outsiders.

King was pitching a better way for Facebook to share data. His plan was designed to meet high ethical and intellectual standards while achieving three important goals: preserving the privacy of Facebook users, protecting the company’s trade secrets on how its data were managed, and imposing no restrictions on what researchers could publish from the data.

The novel arrangement, King believes, could transform quantitative social science by providing researchers with access to truly big data rather than the surveys and small samples that had traditionally been their staple diet. It would also confront a major challenge facing the field: Private companies now possess vastly more information on how humans behave than do governments. And researchers needed better access to those data.

Facebook officials listened politely to King’s pitch but made no promises. He figured he had struck out.

Then, he recalled recently, “I was in my hotel room, packing to go home, when I got an email from the people I had just been meeting with.” It posed the question: “What can we do about this?” referring to the Cambridge Analytica scandal. Company officials, who had watched in dismay as the price of Facebook shares plummeted in the wake of the revelations, were clearly worried about how the scandal might damage the company’s reputation.

A few days later King got a follow-up phone call. “Hey, could you do a study of the 2016 election and tell everybody that we didn’t change the outcome?” a Facebook official asked him. “And if we did something wrong, tell us what to do and we will do it, like, right away.” King says his first reaction was, “I guess losing $100 billion in market cap focuses the mind.”

“An important new model”

The call sent King and Nate Persily, a law professor at Stanford University in Palo Alto, California, into overdrive on their plan to stand up Social Science One, a nonprofit entity that would be the online site for researchers to access any data that Facebook released. Its first project would give researchers a look at how Facebook’s 2 billion users had shared websites discussing the 2016 U.S. presidential election, as well as democratic institutions around the world.

The data sets would contain the web addresses, or URLs, that Facebook users had publicly shared, some characteristics of those URLs, and aggregate information about the sharers, including their age, gender, location, and political leanings. It promised to be a gold mine for researchers studying under what conditions, and by whom, fake news is spread over the internet.

On 9 April 2018, Elliot Schrage, a senior Facebook executive, announced the new initiative, which he wrote would “help provide independent, credible research on the role of social media in elections.” In a blog, Schrage called it “an important new model for partnerships between industry and academia.” And although he didn’t mention Cambridge Analytica, the scandal was clearly on his mind. “The same Facebook tools that help politicians connect with their constituents … can also be misused to manipulate and deceive,” he wrote.

Foundations climb aboard

The April 2018 announcement also listed seven charities that would be funding the initiative. The consortium had been assembled by Larry Kramer, president of the William and Flora Hewlett Foundation, located just a few miles from Facebook’s headquarters in Menlo Park. The foundation had recently expanded a major democracy initiative, called the Madison Initiative, that focused on studying Congress to pay more attention to digital misinformation.

“I remember our program manager trembling with excitement” when she heard about the new partnership, Kramer recalls. “We had just identified lack of access to data as our core problem for the Madison Initiative, and then, boom, here comes this treasure trove that will let us do what we think needs to be done.”

Kramer was able to get the Alfred P. Sloan Foundation, the Laura and John Arnold Foundation, the Charles Koch Foundation, the John S. and James L. Knight Foundation, the Democracy Fund, and the Omidyar Network to sign on. All share an interest in how democracies function, he says. Their ideological diversity—Koch backs several conservative causes, whereas the Omidyar Network is avowedly liberal—was also important.

“We agreed that we needed outside funding to make this work,” Kramer recalls. “Because if it were funded by Facebook, people would distrust the results. That’s just how things are today.” The organizations agreed to provide a total of $11 million for a 1-year pilot project, to be managed by the Social Science Research Corporation (SSRC), a New York City–based nonprofit that would also run the grantsmaking process.

“This structure made sense, and the people running it were top-notch,” Kramer says. “And it got off to a great start.”

In July 2018, SSRC put out a call for proposals, and in April awarded $50,000 grants to each of a dozen teams of scientists. (A second cohort of 13 teams has been selected but not yet announced.) The first round of projects includes studies of how activity on Facebook might have influenced civic engagement and recent elections in Taiwan, Chile, Brazil, and Germany, as well as how users respond differently to mainstream and misleading online sources of news.

Hurry up and wait

But as much as Kramer hoped the unique collaboration among Facebook, Social Science One, and the funders would flourish, he thinks it may have been a mistake to move so quickly at the outset. “This all unfolded very fast,” he says. It is now clear, he says, that everyone involved underestimated the time it would take to come up with an acceptable way to protect the privacy of Facebook users. “Almost all of the issues [around privacy] that have arisen occurred because we really didn’t have the time to cross all the Ts and dot the Is as we normally would have done,” Kramer says.

Grantees like Joshua Tucker, a professor of political science and data science at New York University in New York City, have paid a price for that haste. In January, his team reported on a study finding that older people shared seven times as much misinformation than did millennials. The results suggest digital literacy could be an important factor in how well people can determine the veracity of what they read online.

But that project relied on traditional survey research with people who had consented to share their online behavior. And Tucker wanted to go further, by linking publicly available data he had obtained from Reddit and Twitter to the nonpublic user data provided by Facebook. The Facebook data, he says, would allow the team “to test some of our hypotheses” about how news, including misinformation, is spread across different social media platforms.

The shared-links data were regarded as low-hanging fruit in terms of privacy protection, he adds, because they contained only aggregate information.

“It could tell you that males aged 25 to 35 living in New York state shared a particular link 1000 times, while North Dakota females over the age of 65 shared the data six times,” he explains about the promised data set. “But it wouldn’t contain your Facebook ID, or hashtag, followed by a bunch of things about you.”

For the moment, however, Tucker—who also leads one of four advisory committees that have helped spread the word about Social Science One—can’t get access to those data. That’s because Facebook hasn’t yet figured out how to ensure privacy before releasing the data.

The privacy challenge became clear almost immediately, King and Facebook officials say. In particular, they realized traditional techniques for ensuring privacy, based on anonymization, were no longer adequate. Computer scientists have shown they can identify individuals included in anonymized data sets by using massive computing power to mesh the masked data with other personal information that is already publicly available online.

Given such capabilities, privacy experts told Facebook that it had “sliced the data too thin in terms of the demographic groups and the amount of times [the web addresses were shared],” one Facebook official explains. To ensure privacy, the company would have had to add so much statistical “noise” to the data that the results would have been too distorted to be useful to researchers, the official says.

The answer, Facebook decided, was to use differential privacy. It’s a mathematical approach for adding noise that makes it impossible for an outsider to know whether an individual’s personal information is contained within a particular data set and, thus, ensure their privacy. On an operational level, the Facebook official explained, it meant “we needed a new set of computer servers, with new types of security, and with differential privacy applied to the data sets.”

But achieving that goal takes time. “Differential privacy is a bleeding-edge technology,” King says. “It’s a very important development, but it’s not like there is software available that absolutely works and that has been adapted to all statistical methods. So we realized that we had a year or more of work that we hadn’t planned on.”

“Remember, this is research,” King adds. “If it were easy, it would just be called search.”

Tucker says the collaborators knew they were entering “pretty uncharted waters” when Facebook deal was struck. “The original plan was to let researchers work on the aggregated data and then tackle the thornier question of differential privacy later,” he says. “But that became untenable.”

The quest for differential privacy has come “with a cost of slowing down the availability of the data,” he says. But doing so “in exchange for a mathematical assurance of privacy,” he adds, is a price he’s willing to pay.

A “revolution” on hold

Late last month, the funding consortium and SSRC decided that the clock had run out. In an open letter to SSRC, the funders wrote that they “recommend pausing the grants process unless and until more data become available. … Some or all of us may be willing to consider extending or reinitiating support if new data of sufficient import and value become available.”

Simultaneously, SSRC issued a statement concurring with that recommendation and describing how it would “wind down the project by the end of 2019.” Researchers already funded would get to keep their grants, and those in the second round would be funded if they could complete the project “with the presently available data.”

Some media reports about those announcements cast the delay as another example of Facebook going back on a promise. Facebook officials reject that assessment, saying the company made clear from the onset that privacy was its highest consideration.

Tucker, Kramer, and King say they believe Facebook is doing all it can to pave the way for researchers to get access. “I don’t think they are stalling,” Tucker says. “Everybody wants this research to be done. But it’s just very complicated.”

Kramer says he’s not defending Facebook’s actions. “I don’t care if [the delay] helps or hurts them,” he says about the fate of the project. But he thinks Facebook does deserve some credit for trying.

How it all turns out could affect whether other digital giants, such as Google, also join such data-sharing efforts. “When we started,” Kramer said, “we hoped to make it happen with Facebook and then invite other social media companies sitting on similar data to join in and help us get a comprehensive view. But not one of them was interested.”

A Google official confirms that the company declined to participate when approached by Social Science One. “We decided to wait and see what happens with Facebook,” says Clement Wolf, global public policy lead in San Francisco, California, for the social media giant. “And we’re very interested in how it plays out.”

“If Facebook succeeds,” Tucker says, “it could revolutionize the types of online data to which researchers can get access to and the questions that people who are not employees of that platform can ask. Facebook employees can do that now, but we can’t.”

Some two dozen Facebook staffers have spent the past year chipping away at the problem and have made considerable progress. Last week, for example, Facebook made available differentially protected data on about 32 million website addresses that Facebook users shared publicly more than 100 times in the past 2 years. The data include information on whether the address was reported as containing fake news, spam, or hate speech, and how many times it was shared without being clicked on.

That release bodes well for the project, says King, who sees the funding suspension as merely a bump in the road toward more collaborations between big internet companies and academics. “Data supercharges a field,” he says. “And social science has way more data than ever before. But most of the data are inside companies, and they use it for their own purposes. So we, as scientists, have no choice but to make some sort of agreement with private industry.”

source: sciencemag.org