Join NexChange - the professional
network for the financial services
industry - and receive a free one-
year subscription to Forbes
“Most of the time we used existing tools that we improved ourselves, according to our need,” says Romera.
Essentially every story the ICIJ tells with its leaked documents includes an interactive network graph, built with Neo4J and Linkurious, a visualization tool. That lets readers see how the companies and individuals in the story interconnect. The graph database is also useful to reporters looking to uncover stories and factcheckers working to verify them, Romera says.
To protect the consortium’s sources he can’t go into too many details about how the Paradise documents were received, but materials included emails, spreadsheets and plenty of PDF files.
“One of the biggest challenges was the PDFs, because some of them were only images without text in it,” he says.
That meant that before the files could be loaded into any kind of search engine, they’d have to be processed with optical character recognition software that could turn images of words and numbers into actual data a computer could understand. The ICIJ has built an open-source tool called Extract for efficiently pulling useful data from these kinds of file dumps, and it made some improvements this go-round to handle some new document formats, Romera says.
Emails, too, present their own challenges for analysis, because there’s so much duplicated data across files. Earlier messages are reproduced in replies and forwards, and the same message can be multiple included in the data repository: once for its sender and once for each of its recipients.
“If we were not deduplicating the database, the number of the total documents would be much bigger,” says Romera.
Then, the data can be loaded into the graph database, a tool that Neo4J CEO Emil Eifrem says is becoming more prevalent in journalism overall, presumably as datasets outlining social and commercial networks become that much more common. ICIJ’s success using Neo4J to analyze the Panama Papers has helped stir interest among other organizations working with similar financial data, like banks and tax agencies around the world, Eifrem says.
On the journalistic side, the company counts Buzzfeed, The Guardian and The New York Times as users, offers training and support to news organizations and recently sponsored a connected-data fellowship at ICIJ.
Recent releases of Neo4J have included more support for data visualization without the need for developers or external tools, software for loading information from other common databases, and algorithms that can help find interesting patterns in the data.
“Clustering algorithms are very interesting for finding, for example, clusters of transactions,” says Eifrem.
The ICIJ plans to release additional stories and data in the coming weeks, Romera says, and it’s likely the Paradise Papers will continue to have an impact for some time to come. Nawaz Sharif, the former prime minister of Pakistan, continues to face corruption charges relating to revelations from the Panama Papers.
Given the high stakes involved for so many powerful people, the ICIJ and its partner organizations naturally emphasize digital security: Romera says ICIJ helps its partners install security software and build secure computing environments where they can analyze the data without leaks of their own.
Phishing emails are a regular hazard of work on Romera’s team, he says, with some even directed at newer employees whose email addresses aren’t widely circulated. The team uses two-factor authentication for its own network logins and takes other security measures as well, Romera says—including carefully validating the leaked data it receives.
“You should never trust your data,” he says. “You have to check everything.”