// you’re reading...

Multimedia journalism

Behind the scenes of Aftenposten’s Wikileaks visualization

Eirik Wallem Fossan, Flash developer for the Norwegian media organization Afterposten, recently notified me of a project they recently launched titled “What caught the attention of US diplomats?” that “visualizes a word count of more than 200,000 messages between the US State Department and its embassies and consulates from 1966 though February 2010.” Impressed by the sheer scope of this project, I was intrigued to learn more.

Below is Eirik’s account of how they obtained the data from Wikileaks, parsed it into a usable format, and battled the ethical considerations that underlay the publication of sensitive information. Read on and let me know what you think!

Handling the raw data

Christmas came early to Aftenposten, and one of our IT guys basically didn’t get to celebrate it at all when he was asked to handle the massive data file from Wikileaks. Security was top priority, and after a night of tweaking, they managed to squeeze it into a database and move it to a separate, secure server with no access to the web.

The database was then made searchable and a selected group of journalists were granted access to do their research (All stories: http://www.aftenposten.no/nyheter/uriks/wikileaks/).

Since then they’ve tightened security even more, and we won’t go into the details, but there have been attempts at breaking in.

Concept

We quickly decided to aim for content analysis of the material. There are several ways to visualize word counts, but word clouds seem to be a fairly established format. Besides the correspondence itself, the data was also structured according to time (months) and geography (origin of cable).

Refining the data

To work with the data and generate datasets for our project, we had to request a full copy of the database. This was of course a nightmare for our security-minded IT department, but nevertheless we got a full SQL-dump on a two year old MacBook Pro with a broken keyboard, an encrypted hard drive and a freshly installed OS.

We started off with MySQL to generate single text files for every month of correspondence from all embassies and consulates. This resulted in more than 100 text files. Secondly, we applied the following process to all text files:

* Numbers, special characters and punctuation were removed (except 9/11 and some apostrophes). Tool: sed (a scriptable editor)
* All text was converted to uppercase. Tool: sed
* Word count. Tool: awk (an awkward – but fast – programming language)
* A long list of more than 400 stop words were stripped from the word count. Tool: grep (unix command)
* The top 100 remaining lines of each month were written to a new file. Tool: head (unix command)
* Text files were finally converted to XML. Tool: sed

This was a time consuming process, and all the time we were googling and hacking at the data, we were painfully aware that lots of programmers could do the same process with an embarrassingly short shell-script.

The process was repeated and improved in several iterations, and finally we ran the same queries and scripts on a selection of embassies and consulates.

Visualising the data

Generating a simple word cloud where words are placed next to each other on structured lines and scaled according to individual values, is a straight forward thing. Placing the words in a more visually appealing form, like those seen on www.wordle.net, is a completely different matter.

After extensive googling, we finally found a hint from the Wordle creator himself, Jonathan Feinberg.

The solution was to allow each word to follow a spiral’s motion and continuously do bitmap hit tests for a vacant slot outside the growing word cloud. The process was quite CPU intensive, so optimization was key here, and after a bit of fine tuning we now average around 400 – 700 ms for a word cloud consisting of 100 words.

Btw, Arthropod is a great tool for monitoring general performance levels.

Usability was a major issue in this project. We made sure to make all interactive choices blue, analysis results red, place the search field in the upper right hand corner, and maintain an overall sober expression. Despite our measures, the first version of the graphic was close to slaughtered. Computer literate colleagues looked like 80 year old newbies when they tested it. They simply didn’t get it, and important elements of the graphic were obviously invisible.

Next and final version of the graphic was somewhat simplified. We had to kill a few of our beloved darlings, like annotations of analysis results and an option to “keep” a selected word when navigating the timeline, and the experience is that you can’t make it simple enough.

Last touch was to integrate a direct Wikipedia search, which we reckon adds significant value to the whole piece. This actually involved setting up a proxy due to crossdomain issues, and contacting Wikipedia to get their permission to do so.

Having a library of Actionscript utilities simplifies the production process, but this project involved quite a bit of coding from scratch. The bonus is that we now have a generic class to present word clouds in Flash.

Considerations

Looking back we see several things we perhaps should have done differently. There are still a few disturbing elements in the graphic, like the ordered list on the right hand side. Also, the search result from Wikipedia is sometimes a bit bonkers, and a country’s name should be removed from its dataset to get more interesting results.

There were also a serious ethical issue to consider. Did we risk publishing the names of persons mentioned as sources or informers in the cables? Yes, but no. Every single word is removed from its original context, and being named in a word cloud does not imply that you are an American spy or informer. That goes for every Laden, Smith or Chang.

And what about its information value? Well, in some ways the graphic tells the story of how enormous and important the raw data is to understand US diplomacy the last 30 years, but the word clouds themselves seldom do more than hinting at trends in the underlying material. This is of course the common criticism towards word counts as a methodology, but as an exploration or a research tool, a backgrounder or an illustration for other stories, it does the job.

Involved in the project

* The journalist, who got hold of the data
* The trolls in our IT department, Media Norge IT, who have chosen to remain anonymous: Initial database import, export and security.
* Atle Brunvoll (@m_mnemonic): MySQL queries, data manipulation, CMS and Facebook integration
* Eirik Wallem Fossan (@eirikwfossan): Interface design and Flash programming

Other posts that might interest you:

This work, unless otherwise expressly stated, is licensed under a Creative Commons Attribution-Noncommercial-No Derivative Works 3.0 Unported License.

Discussion

No comments for “Behind the scenes of Aftenposten’s Wikileaks visualization”

Tcoder transcribing app

Advertisement

RSS Subscription

Archives

Facebook