A Tale of Two Wars:
Distantly Viewing the Iraq & Afghan 'War Diaries' 
David J. Thomas | thePortus.com | davidjthomas@usf.edu 
In the summer of 2010, an unknown source within the Defense Department provided WikiLeaks a highly classified database containing over 492,000 files somewhat misleadingly called the 'War Diaries.' The files concern what the military calls 'kinetic events,' meaning any time a situation became potentially lethal.
 The Source:
Full Description

Date & Time 

Latitude & Longitude

Type of Event

Initiating Faction

Unit Involved

Casualty Totals

More
The entire course of the war, as seen by U.S. Central Command, can be plotted in time and space.

Two modern wars can now be studied on scales never before possible allowing new orders of questions to be proposed.

Yet, making this easy and available to the public is still not so easy... 
The Records 
 All together, the database appears to contain every single event from both Iraq and Afghanistan, as known to U.S. Central Command, from January 1st 2004 - December 31st 2009.
WikiLeaks subsequently published the database on their site and so it is now a matter of public record and thus legal to view.
Obstacles:
The online database lacks any analytical tools, records are accessed individually
• HTML files are unsuitable, data must be extracted for analysis
• The data contains errors requiring cleaning and preparation
• Reliability of the data is uncertain
• Provide an excellent, large data set for students to learn the basics of visualization
• Spur discussion about privacy, data transparency, and mass analysis
• Drive interest in the research potentials of Digital Humanities
• Demonstrate how digital methods can be combined to ask new questions of existing information
Goals:
Methods & Solutions
Trying to do anything while accessing the website remotely would be painstakingly slow. I needed to acquire a local copy of the files. So, I used wGet, an automated downloader, which starts at the first page of the database, and locates every file and page, downloading all the HTML files along with images and other files (20 gigabytes) to my local drive. Here is the starting page for the Iraq files, and you will find the Afghan page here.
Raw HTML files are not good for data analysis. So, I wrote a script in Python using the BeautifulSoup module to extract the information. The script locates and parses each HTML file then extracts the data before finally converting it to a spreadsheet (.csv) file. You can find the script I wrote here, which must be run in the same directory as the downloaded files.
The data contained many errors, in order to clean hundreds of thousands of entries efficiently, I used OpenRefine. Employing clustering algorithms, I was able to quickly identify entries that contained slight variations in spelling. Using text facets, I instantly reclassified redundant event types.
I loaded the data into several different packages for visualization. I used the PowerMap plugin for Excel to generate the heat maps. For most visuals, however, I used the fantastic visualization package, Tableau Public.
Here is the cleaned data, so that anyone can visualize and analyze the information themselves



In addition, I created two KML files, allowing every event in each theater to be seen on Google Earth (Note, they are taxing on most systems)


Step 1: Getting Files 
Step 2: Extracting Information
Step 3: Scrubbing Data
Step 4: Visualization
 Following the initial database publication, several sites put the data online in the form of .csv files. Technically, this makes steps 1 & 2 unnecessary. However, this is intended as a demonstration of digital methods. Above all, I wanted to show students that online databases meant for browsing can be transformed into objects of analysis for which they were not originally intended.
The Visuals
*These results are only demonstration visualizations. They are not intended as a part of any sustained argument or larger narrative. Accordingly, the visualizations deal with disparate topics. Aside from a short note at the end, I will largely abstain from interpretation, which is not the aim of this exercise.
Iraq 
Afghanistan 
Baghdad 
Kandahar 
First Glances: Mapping the Casualties of War
Snapshots: Comparing Conflicts
 To View Images: Right-Click on the Image and Choose 'Open Link in a New Tab...'
 Timeline and Bubbles: Events and Casualties Comparison
by Region within Theaters
Faction Initiative and Casualty Comparison by Theater)
Timeline: Total and Average Casualties by Most Common Event Types
Timeline: Total and Average Casualties by Faction Initiative
Timeline: Total Enemy Captured by Theater and Event Type
Total Number of Event Types by Regions within Theaters
Comparing Total Number of Enemy, Friendly, and Civilian Casualties by Event Type and Region within Theater
Scatter Plot Friend v. Foe Lethality (per event and average) by Event Types and Theater
Timeline: Deadliest Months. Total Killed in Action by Action and Theater
Timeline: Deadliest Days. Total Killed in Action by Action and Theater
Timeline: Deadliest Hours. Total Killed in Action by Action and Theater
Snapshots: Iraq
 Timeline: Faction Initiative and Event Types
 Bubbles: Friendly and Enemy Killed in Action by Event Types
Casualties by Event Type and by Most Common Friendly and Enemy Actions
Timeline: Number of Events and the Dates of Ramadan
*Note: This chart is in no way intended to suggest a causal relationship between these events and Islam (results are inconclusive anyway).
HexBin Map:
Iraq,
Concentration of Casualties and Highly Pre-planned Attacks
HexBin Map: 
Baghdad,
Concentration of Casualties and Highly Pre-planned Attacks
Timeline: Friendly Casualties and Deadliest Months, Weeks, and Years
Snapshots: Afghanistan 
 Timeline: Number of Events and Casualties by Event Type
 Timeline: Casualties by Faction Initiative and Level of Enemy Planning
HexBin Map:
Afghanistan,
Concentration of Events and Casualties
Scatterplot: Annual Lethality Fluctuation (Friend v. Enemy) by Event Type and Faction Initiative
HexBin Map:
Deadliest Events,
Concentration of Events and Casualties
Timeline:
Most Active Unit
Combined Joint Task Force 82,
Number of Events and Casualties
HexBin Map:
Most Active Unit
Combined Joint Task Force 82,
Concentration of Events and Casualties
 Concluding Thoughts*
 * For now
The visuals above demonstrate the potential of combined digital techniques. They also barely scratch the surface of the nearly endless questions that can be asked of the information. Posing meaningful questions of the information and interpreting the results sensitively are the domain of humans. All signs suggest that the U.S. Government never thought these files would be seen. Yet, journalists have already discovered incidents where numerous eyewitness accounts contradict the 'War Diaries.'

In the end, this database reflects the sum of entries written by soldiers on the ground. Any number of factors might affect how events were reported, such as the fog of war, cultural differences, or the principle of CYA. In addition, by determining the fields (e.g. Event Type) as well as the terms (e.g. Friend, Enemy) available, the defense planners define the range and manner of possible expression.

Data must be read as sensitively as any textual source. Whatever the difficulties of reliability and interpretation, this source unquestionably gives the public an insight to the events of two wars on a scale that was scarcely conceivable as recent as the Gulf War.

Nevertheless, this source remains critical even if we choose to choose to assume a cautious stance towards the reliability of reporting. For, it shows us how the war appeared to defense planners. Beyond events themselves, this data can illuminate the decision making process of the U.S. political and military command structure. 

The results found here are just a beginning. They were a random sample of individual questions which lacked a sustained narrative. As a demonstration, this was not my intent here. But, only is analysis and interpretation does this data start to have value. As many voices enter the debate about these and other digital issues, scholars of the Humanities and Social Sciences contribution to the discussion is more important than ever before. As for this source, I will leave it up to others (including some of my students) to do so for themselves.

Should you want to do so for yourself, below you will find links to the spreadsheets and Google Earth files. Best of luck!