‘Numbers you can tell stories with’: a decade of Guardian data journalism

“It’s a new way of doing journalism which, like punk, anyone can do”. So said the Guardian’s data editor, Simon Rogers, in a TedX talk entitled “Data journalists are the new punk rockers” in 2012.

As far as career-changing sentiments go, it’s admittedly a strange one. But, looking back now in my role as the Guardian’s acting data projects editor, the idea that, with the right tools, anyone could be a data journalist was a real eye opener.

Numbers weren’t attractive in the slightest. I had spent my entire school life dreading maths class and clinging on to the knowledge that my relative strength in English subjects would go some way to countering my low maths grades.

Pamela Duncan, formal byline portrait — Pamela Duncan: ‘You don’t have to be a mathematician: you just have to be a storyteller.’ Photograph: Graeme Robertson/The Guardian

Yet here was the Guardian’s data editor, a journalist who worked with numbers every day of his life, saying maths made him anxious too, and that really data was just a way of understanding and interpreting the events of the day.

It made me think there was, indeed, a different way of doing journalism. And you didn’t have to be a mathematician: you just had to be a storyteller.

By the time the talk went out, in December 2012, the Guardian Datablog was already an established part of the Guardian’s news landscape. But it hadn’t always been that way.

Prior to 2009, we and other publications had published “data journalism” before such a term existed. In fact, a data story – a leak of statistics relating to deprivation in Manchester – featured in the pages of the first ever Guardian newspaper 200 years ago. Computer-assisted reporting emerged in the US in the 1960s and later played a pivotal role in the Boston Globe’s Spotlight team’s investigation of child sexual abuse. A still-active bunch of journalists, the “Nordic pioneers”, have been digging into data since the 1990s.

But data was far from a core competency: newsrooms were predominantly staffed by people like me, who at best were numbed by numbers and at worst declared themselves allergic to them.

Talking from his home in San Francisco, Rogers – appropriately framed by the albums which decorate his walls – told me how, having joined the newsdesk the day before 9/11, he was thrust into a hugely important news story told not just through words, but also graphics. He later became a news editor in that department, working with the paper’s graphics team to tell visual stories.

As part of this role, he began “collecting data” (a concept I am now well familiar with: half the battle is knowing where to look for the relevant information). The Datablog began its life, then, as a repository for interesting datasets and, as time passed, a home for journalism built upon analysis of that data.

It attracted a “nascent data journalism community” including other journalists, developers, data scientists, advocates of open data and others who did not want simply to consume stories, but engage with the underlying facts. A Flickr group was created where this community discussed potential leads and built their own homemade graphics from the published data.

But although Rogers recalls senior editors of the time being supportive of the idea, the data team was not always pushing an open door.

In the blog’s early days he persuaded the newsdesk to publish a story centred on “extremely granular” government spending data: “I can remember being in the news meeting where the editor in charge said, ‘so we’ve got that “data” story’ – like it was a slight joke. But they still devoted a lot of space to it. The corner was being turned”.

Simon Rogers, formal byline portrait — Simon Rogers: ‘All of a sudden you had this story that was super data-based … and the data complemented the reporting.’ Photograph: David Levene/The Guardian

Then, in 2010, WikiLeaks published the Afghanistan war logs. “All of a sudden you had this story that was super data-based … and the data complemented the on-the-ground reporting. That felt like the first time where we started to really integrate into the journalism process.”

When the London riots occurred in 2011, the data team found itself in a position to treat the story in a way none of its competitors could, scrabbling together a team of journalists and journalism interns to extract the related data manually from PDFs containing every single crime recorded on those days, as part of the Reading the Riots project.

“The London riots was a big deal. We started collecting data on everybody who was up on trial for involvement in the riots. We wanted to look at things like: where were people from? What were their backgrounds? Were they from the poorest parts of the city? And we really had to fight hard to get that data out of the Ministry of Justice.”

When Rogers pitched the project to the news editor, he immediately saw the potential, and requested that they make the story the Saturday splash: “On Monday we had no records in the database; by Friday we had this database we could tell stories with.” This, to Rogers’s mind, was evidence that data had indeed been integrated into the newsroom.

With nine years and four London-based editors (James Ball, Alberto Nardelli, Helena Bengtsson and Caelainn Barr) between Simon’s tenure and mine, and a slew of talented journalists (Mona Chalabi, Amy Sedghi, John Burn-Murdoch, George Arnett, Niamh McIntyre, Tobi Thomas), data remains integral to the Guardian, although the offering has changed.

The Datablog as a repository for datasets no longer exists: we simply do not have the available resources to clean, assess, publish and maintain the sheer number of datasets we use day-by-day. The technologies available have become more sophisticated and we have broadened our skills to deepen our analysis, further bolstering it by embracing statistics and coding skills.

What’s more, in the past decade, data has become ever more freely available. This democratisation of information has been key to the evolution of our discipline. Gone, for the most part, are datasets trapped in PDFs. Instead, we are more likely to pull information from ready-made interfaces such as APIs; write code to scrape data from websites; and use advanced RegEx searches to pull information cleanly from multiple documents and build a spreadsheet from unstructured data.

More than a decade on from the blog’s 2009 launch, data remains a core part of the Guardian’s news offering. Now the Data Projects team, we are a constituent part of the newsroom. We sit metaphorically (and, on our imminent return to the office, physically) between news and investigations. We are generalists, but our skills can be turned to any area: from health to housing, from education to environment, from sport to international investigations, from asylum to government U-turns.

No longer a “new” way of doing journalism, data as news is here to stay. And, take it from me, you don’t necessarily need to be a numbers person to do it.

source: theguardian.com