Online only

From geeks to rock stars in one grim leap

On the freezing night of 8 February, 77 London Freelance Branch members stayed at home to hear about data journalism and statistics in this time of covid, from Pamela Duncan of the Guardian data journalism team.

Pamela Duncan

Pamela Duncan addresses the February meeting

Pamela analyses and decodes data to produce stories, including those on Russian trolls' tweets being carried by UK media outlets; asylum seekers being sent to the poorest parts of the UK; and half a million pieces of data being lost by the NHS while being sent between GPs and hospitals in the five years up to 2016. She also contributed to the Guardian gender pay gap investigation.

Our speaker has been a visiting lecturer in data journalism at City, University of London and a trainer with organisations including the Thompson Foundation and the Centre for Investigative Journalism. She earlier launched the Irish Times data project.

Pamela joined us from Ireland - "a very cold part of the world but not as cold, apparently, as parts of England" that night. "We breed very good data journalists over here in Donegal" - her colleague Caelainn Barr is from the same county and the third in the team is Niamh McIntyre. "All three of us have trained as journalists, through the National Council for the Training of Journalists (ICTJ) and in-house at the Guardian."

The Guardian data projects team

The Guardian data projects team

She told us she has practically "forgotten that I used to do all those things - because for the past year my entire focus has been the covid pandemic".

Back in May, "my boss told BBC Radio 4 that "three months ago, if someone at a party introduced themselves as a data journalist, you might have moved swiftly on. But now they're the rock stars."

"We thought we always were rock stars," Pamela notes: "It was just that other people in the newsroom didn't know how cool we are - until the start of the pandemic."

Data journalism is more collaborative than other forms of journalism, because of the wide range of skill sets required. Pamela is "a big believer that it is best done in cooperation with specialists" such as statisticians.

The team works in close collaboration with other journalists, too, "either with our own colleagues at the Guardian or internationally when that's required" - for example through the International Consortium of Investigative Journalists (ICIJ) who together broke the story of the Panama Papers, a trove of 11.5 million documents leaked from a law firm with interesting clients.

Before All This, Pamela's work was a mixture of deep investigation of large datasets and quick-turnaround stories for colleagues. And then there were colleagues "coming to us for day-to-day run-of-the-mill stuff" such as "how do you calculate a percentage change?"

Lots of journalists say they're "allergic to numbers". "I understand, because I hated maths. But I'm good at logic." In the newsroom some people get it, some will learn and some really won't. But now "they know more about what kind of stories we can do."

"When covid came along we had to completely change tack," Pamela recalled. "Instead of working on medium- and long-term projects we had to shift to doing daily stories. And those were stories that, before, we would have taken a couple of weeks to do. Now we had to drop everything else to get them done."

As a data journalist, Pamela checked the database. She found that she has had just over 400 stories in the Guardian in three years - and two-thirds of those are since March 2020, and 97 per cent of those are covid-related.

Reader response has changed, too: "Our audience have developed an appetite for statistics in a way we'd never have thought possible. We get called out on methodology." They have always been ready to defend the methodologies they use, but now they have to do so more, in detail. "And we increasingly have to explain our workings to the news desk, as well as doing everything else under such pressure."

The call for hard, clean stories

Pamela will talk with statisticians about a story, and they'll explain all the different things one has to take into account in interpreting the numbers, and the reasons why the probable interpretation may possibly not be the right one. After that, "it's very difficult to get the statisticians' conclusions through to the news desk - who want hard, clean stories. They want to hear that it's this or it's that... not "maybe, might be, possibly, the confounding factors include..."

Back in the day, the team used to do some coding - writing small computer programs, for example to process the screeds of data. They've had to drop that under the pressure to write more English. "It might have been helpful to do some coding, to automate updating data-sets." But those data sets change so often - a particular example being when the records of deaths were handed over to Public Health England (PHE). They'd have to rewrite at least a bit of each program each time a data set changed format, for example.

Pamela Duncan's slide: The trouble with covid data

Pamela Duncan's slide: The trouble with covid data

"No-one was prepared to report on a pandemic," Pamela recalls, "including NHS England. They didn't have the processes to collect and distribute data in place." At the start of the pandemic not enough data was being produced. Then when they made more available it was full of errors, and that defined Pamela's work in the early phase of the pandemic.

The most glaring example was that, in the beginning of the pandemic, the government's count of deaths included only people who had died in hospital. In the real world "there were many people dying in care homes," Pamela reminded us: "what was omitted or done in error was sometimes the what we were leading with."

Those stories did result in more and better data being available. But "even to this day the daily counts of deaths you're getting are only part of the true story." The Freelance suggests, for example, that as well as the headline figures you also look critically at the total of "excess deaths" compared with previous years. Then, we hint, you could consider that influenza deaths are way down because people are distancing and wearing masks - and ask what that says about the causes of the excess deaths.

Some of the errors that government made have been exploited by those who want to deny that there is a pandemic. An example is the miscounting of deaths that weren't necessarily covid deaths. For convenience, the headline PHE report is of deaths within 28 days of a positive test. Some of those will have been treated and discharged cured and then hit by a car... "If people are listed in error as covid deaths," Pamela noted, "this allows some to pick up on our coverage and portray it as evidence that the pandemic is being exaggerated - even though the actual total death toll from covid has always been higher than that presented by the government."

"The claim of exaggeration is very disrespectful to those who have died and to those they leave behind," Pamela points out.

The slides at the daily government press conferences were less than useful. In particular, its international comparisons weren't useful because different countries were counting deaths in different ways "and we pointed that out".

A beginner's blunder

Then there were some stupid blunders. "In October," Pamela recalled, "we discovered that PHE had been overwriting a data sheet in the Excel spreadsheet program each day." This was a problem that a rookie data journalist could have warned them about: "they'd have told you early on that this wasn't the way to do it." Microsoft Excel has a published limit of about a million rows (1,048,576 exactly) and if you paste in data that would make a spreadsheet larger than that, some rows simply disappear. That's why whenever you are processing large amounts of data you want to use a "proper database" - typically one that you interact with through the SQL language - not a spreadsheet. (SQL stands for "Structured Query Language".)

The result was that 16,000 cases were missed. That meant that information on an estimated 50,000 contacts for those people was lost. Pamela mentioned that a colleague has suggested that this error alone may have caused 1500 deaths - since those contacts could not be told to self-isolate and thus some of them will have passed on the infection when they shouldn't have.

"The best counts of deaths come from the statistical agencies," Pamela noted, referring to the likes of the Office of National Statistics (ONS). "But the problem is that these can only be published after a time lag - so the other government data does have a place, because we need speedy turnaround to take decisions," for example on re-imposing lockdowns.

A disunited kingdom of data

"This is not a United Kingdom of data," Pamela observed. "In the beginning we found that we had to follow at least six sources to keep an accurate account, and we still have to look at five data-sets each week." As mentioned, we need to look at the total of "excess deaths" as well as deaths directly attributed to covid one way or another: that "might include people missing routine hospital appointments for heart disease," for example.

Pamela regrets that "covid has fed into the culture war scene". Everyone involved "has to be extremely careful in reacting to comments on social media". In October the Guardian had a formal complaint from a reader on the team's methodology. The reader claimed that "someone dying with covid on their death certificate is a discredited method of counting." Pamela is "sure that the ONS would disagree with that".

On the positive side, "ONS and its equivalents in Scotland and so on have been very, very helpful." The government data dashboard and its "application program interface" for fetching data from it ("API") have improved.

"The scientific community has been incredible," Pamela is pleased to report. "They have been so generous with their time. As journalists we often complain about scientists working to a slower timescale" than getting the news out today - but they've stepped up to the mark so often with this. The Science Media Centre especially has been "an incredible resource".

With their help, "data journalism and even statisticians have gone from being just geeks to being rock stars."

The first of three pages of attendees at the February meeting

The first of three pages of attendees at the February meeting, preserving some privacy

Questions

Several members wanted to know they could get a piece of the journalistic action. The answer is "probably you can't - most newspapers now have their data ducks in a row by now." But, in the longer run, "every journalist and in particular every freelance journalist could benefit from knowing more about data." Pamela mentioned courses from the Centre for Investigative Journalism.

And if you go to a paper saying: "I have this exclusive data that I've built a story out of," Pamela advised, "and if you have real data competency, it's a real feather in your cap".

One member asked "what is coding?" How did Pamela get involved in that?

"When I started as a journalist I was at first freelancing for the Irish Times - and once a freelance, always a freelance. Back in the day, as a reporter straight out of college, I wanted to break through and get on the front page of the Irish Times. My first win was a story about councillors' expenses in Ireland, and the next was on the pension pots that members of the much-hated then Irish government were getting."

While she was working on these her sister said: "you could use Excel for this, you know" and "I was, like, 'Excel, eh?' and frantically looking it up." It "turns out that it's not a massive jump from using that program to SQL and coding." And "my boyfriend is a software developer - our career paths started so far apart - never ever did I think that and journalism would come together."

One example of "coding" is that "you might want to look at every page on a website, maybe to see whether the data on it has been updated. I used to visit each page 'by hand' and copy information from it. Now I can write a script - a small program - and it will hit every page and copy a certain piece of information and drop it in a spreadsheet for me."

Another member asked about people who are offered covid vaccinations, but when the time comes none is available. "Are they counted in the government's figure for people 'offered' a vaccination?"

Pamela is "fascinated with that question of the government's insistence that they will tell us who has been offered a vaccine, as opposed to being given one". But she understands "that they count only actual possible vaccinations". That said, "things are happening so fast that perhaps this has changed since I was at work on Friday. If any of you know better, do let me know."

The member found this especially important given vaccine hesitancy in some groups. Pamela "agrees entirely: in a sense the more data comes out, the harder it is to keep the government accountable".

Is data political?

"A year ago," Pamela said, "I would have answered, 'yeah, in general terms...' But since then we've had a lot of pushback about our methodology and we see our reports quoted to argue the opposite of what the data shows. We're told we're lying through the data - when we're the ones trying to show it in its truest form and show the failings in it... so it's deeply political."

Member Dapo Ladimeji is interested in data visualisation. He recalled the Fraud Office giving a presentation to demonstrate that it had started using geodata. A data visualisation popped up - and a further suspect in the case leapt off the screen, as it were. (The data also showed, though, that he'd promptly left the country.) So: why is so much data visualisation on covid rather poor?

Pamela answered that "we're limited by the data-sets we have. Lots of the geographical data in them is in forms that don't speak to each other." For example: "on vaccination we get data for jabs nationally and at NHS region level. And we have lots of really good data for infection figures - but that's listed by local authority area." NHS regions don't simply map onto local authorities, so they can't draw out any effects by either kind of area breakdown.

On the other hand, "I think the most important thing is that the NHS is getting the vaccine roll-out done. It'd be a nice extra if in the weeks to come could map it in a more accessible way."

In general, newspapers "have been overwhelmed with data analysis at a time when we're also trying to come to terms with working from home. I haven't been in the office since almost exactly a year ago, apart from once to clear my desk."

Another member sees data being "weaponised" by people who don't understand it - in particular those who don't understand "data artifacts": "one radio talk show host comes to mind..." The Freelance offers a simple and relevant example of a "data artifact": you may see an apparent seven-day cycle in death rates, That is in fact caused by delayed registration over the weekend.

Pamela would have liked to "just go to the pub and talk about this afterwards..." All sorts of people are using data without understanding. She notes the "common maths and data errors that creep into reporting," too, and suggests we look at the corrections and clarifications page for examples close to home. Some are as simple as the confusion between a percentage difference (the ratio of two numbers) and a difference expressed in percentage points (one percentage figure minus the other).

"There is a lot of talk among the data journalism community and among the statisticians to whom we've become ever closer," Pamela reports, "about data literacy" - but we're not in a position to make sudden change.

Returning to people who actually abuse data: "If they are determined to spread misinformation they're going to do it and they're going to abuse our data - but all we can do about is to tell the truth."

Member Julio Etchart asked whether numbers are available for people in BAME communities who are reluctant to take vaccines, given recent reports that, for example, Black over-80s in England are half as likely as white people to have had the jab as white people.

Pamela responded that two weeks before the meeting NHS England started providing some breakdown of people going for vaccines. She didn't work on the story Julio mentioned - she had done some work on the numbers it was based on, and that work wasn't published because it "would have clashed". She suggests looking at the NHS England vaccination site to get the latest weekly file and cross-referencing that with the latest estimates from ONS of BAME population by area. She recalls finding that as of the Thursday before the meeting white people were 2.4 times more likely than mixed-race people and people and 2.2 more likely than Black people to have had the vaccine.

Julio added that he had had a vaccine but there was no questionnaire asking for his ethnic origin.

Beyond covid

A member asked whether Pamela had ever worked on environmental data - in the context of the forthcoming UN COP26 Climate Change Conference from 1 - 12 November in Glasgow: "do you have any feeling about digging into huge amounts of data - where there may be topical stories waiting to be discovered?"

"Yes." Pamela believes that this "is the story of our time and one that's taken a back seat as we have so many other things on during covid". She's concerned too about Universal Credit and growing inequality in the country - "and these things are probably all going to be exacerbated by covid."

For the moment "we don't have the capacity to do all of them - covid has been all-consuming for the past year. In 'normal' times we could do with a fourth person and with five we'd be just about managing with covid."

Before covid "we did an investigation into the amounts of cash that Conservative politicians got from climate contrarians. But now we're being thrown around by the covid tide. I hope to get back to that story - and more on contrarians trying to influence policy. I'm absolutely dying to get my teeth into these things as soon as I can."