Subscribe to EagerEyes.org feed
Updated: 4 days 12 hours ago

What is Data Journalism?

Wed, 07/30/2014 - 05:03



Is a data journalist one who unearths the data, who finds the insights in the data, who finds the right way to visually communicate the data? The answer is, of course, all three. But let’s tease them apart and look at each separately.

Unearthing the Data

First, the data has to be found. And finding, in the journalism context, doesn’t always mean just scouring the web. There may be sources nobody else has access to. Data may have to be wrangled out of the government’s hands with Freedom of Information Act (FOIA) or similar requests. That data then might need to be cleaned, processed, sometimes even made machine-readable in the first place.

Cleaning data is not easy, and can be incredibly time-consuming and error-prone. It requires good knowledge of data cleaning tools, scripting languages, optical character recognition (OCR), and the common pitfalls of different data formats and types. It can be difficult to verify that cleaning the data has not inadvertently destroyed or skewed it in some way.

One example I am familiar with is the work Sarah Cohen and her colleagues did for their Careless Detention series in The Washington Post in 2008. They collected data on the deaths of immigration detainees (using FOIA requests, etc.), and were able to see regional patterns on the resulting data of what they classified as questionable deaths on a simple map. The resulting story was based on the people rather than the data, but the data led to the story.

Finding the Insights in the Data

Once the data has been found, it needs to be analyzed to find out what it actually contains. Most of the time, there is nothing interesting in it. But in those cases where a discovery is made, it can make for a great story.

The skills required here are quite different from the data digging. The key skill is data analysis: statistics, data exploration, hypothesis testing, etc. It can also require domain knowledge, e.g., about economics when the story is about unemployment, etc.

One of my favorite examples is Hannah Fairfield’s Driving Safety in Fits and Starts in The New York Times two years ago. The data had to be collected, but was publicly available. The key part that made for the story, however, was finding explanations for the patterns that emerged (and also a very compelling visual representation).

Communicating the Insight

Given the insight, getting that across should be easy, right? Well, no. That’s the big mistake many academics make, and this is where you can see the most impressive work.

Perhaps my favorite example of that is Jonathan Corum’s whale/lunge feeding graphic (discussed in his phenomenal Tapestry talk last year). The data was collected by scientists, and they had already created a chart for a paper. Corum’s insight was to put something back into the graph that the scientists had left out: the depth the whale was diving to. Perhaps this was obvious to the scientists, but certainly not to the readers of The New York Times.

A more recent example was on the behavior of dogs in different settings. The data again came from a paper, which even included almost the exact same chart that was used in the NY Times piece. But only almost. The key differences are what turn a boring bar chart into an interesting, readable one: color, spacing, and some cute drawings. Enough for Kaiser Fung to use it as an argument that visualization can be worth paying for.

However, Greg McInerny argues that the NY Times version loses some important elements of the chart, namely the statistical significance of the differences. He also proposes some alternative designs that retain most of the stylistic improvements, while adding a bit more information.

Either way, the key part here is finding an interesting story like a gemstone, pulling it out of the surrounding material, and making it shine. It doesn’t always have to start from the raw data, though.

What Makes A Data Journalist?

All these examples are pieces of data journalism. Not all involve visualization. Not all entailed digging for data. Not all even require finding the insight yourself.

What data journalism requires, then, is a broad mix of skills and instincts. Not all are necessarily needed in all cases. But you never know which ones a story will require. Many of the technical and math skills are still rather unusual among the people typically working in journalism. That makes this new direction so interesting but also so problematic: how do we know if we can trust the work produced? Alberto Cairo is skeptical, and wants data journalism to up its standards.

But in a way, data journalism is the logical extension of what journalism has been all along: collecting facts and data, understanding the implications, finding the story, and reporting it. The tools and materials are changing. But soon, all journalism will be data journalism in one form or another.

Putting Data Into Context

Thu, 07/24/2014 - 04:59



Raw numbers are easy to report and analyze, but without the proper context, they can be misleading. Is the effect you’re seeing real, or a simple result of the underlying, obvious distribution? Too many analyses and news stories end up reporting things we already know.

This is a particular issue with data that has a spatial component. When the data is shown on the inevitable map, you often just see a distribution of people. Where there are more people, there are more tweets, there’s more crime, there are more customers, there are more coffee shops, etc. Many maps in fact show nothing but the underlying distribution of people. As usual, xkcd has captured the issue beautifully.

Crime on New York City Subways

An examples for this is a map recently published by New York Daily News showing the amount of crime in subway stations in New York City. Each bubble on the map shows one station, with the size of the bubble representing the number of crimes. Mousing over the bubbles soon reveals that many of the higher-crime stations are the ones with more than one line passing through (and thus likely more people).

While it is possible to switch the map to a view that shows the number of incidents per 100,000 people passing through the station, the story leads with the raw numbers. The data was actually collected over a period of five years (July 2008 to June 2013), which increases their magnitude. Reporting crimes per year seems like a more honest way of looking at these numbers.

The station with the most crime (Times Square/Port Authority) had 1791 incidents during that time, that’s less than one per day over five years. Incidentally, that is also New York’s busiest station, with more than 166,000 people passing through every single day. So that seemingly scary number of crimes comes out to one for every 170,000 people, which puts it in the safer half of stations (rank 219 out of 420 in the dataset).

The number of people passing through each station is actually included in a little visualization in that article, though it’s not shown directly. Since they use Tableau Public for that, I was able to tease it out and create a pair of maps comparing the total number and the per–100,000 numbers side by side. Click on a bubble in one of the maps below to highlight it in both. There is also a tooltip listing the numbers.

The difference between the total and the per–100,000 number is important not just because it changes the picture. It’s the much more interesting number. The total number of crimes means nothing for the individual. What is much more relevant is how many crimes happen per person, because that provides a much closer estimate of each individual’s risk of becoming a victim. Large total numbers make for great headlines (Over 191,000 crimes total!), but they’re largely meaningless otherwise.

The per–100,000 number varies widely even within the top few stations based on total number of incidents. Number 2, 125th St, has over six times the rate per 100,000, and number 5, Broadway Junction, has almost ten times the rate (but less than half the total number) compared to Times Square. The distribution of crime per 100,000 on the map is also clearly different, with the Bronx, Brooklyn, and the Rockaways more dangerous than Manhattan.

There is also an interesting outlier, Broad Channel, which is discussed in the article. It is less of an outlier than it appears here, though, because the number of daily riders per station in the dataset is people going through turnstiles, which does not count people transferring between lines. Those make up a large portion of the people in some of the stations, like Broad Channel, though.

So even this first step of putting the data into context is not complete, but would benefit from better data about the actual number of people passing through each station per day. And there is clearly a lot more data that could be collected here: time of day, how busy each station is during those times, what kinds of crimes occur during what times of day, etc.

Comparing Cities

Another example is this New York Times story from over a year ago, which still gets my blood boiling when I read it. It compares New York City to Seattle in terms of their “geek appeal,” and talks about data science, among other things. The irony is completely lost on the author, though, who uses absolute numbers to compare two cities whose populations are vastly different.

Did you know that Seattle had only 139 Starbucks in 2013, but New York had 271? Or that there were a puny 85,000 IT workers in Seattle, compared to 168,400 in NYC? What a provincial little town that Seattle place is!

But when you account for the fact that New York is almost nine times as large, things look a bit different. Seattle has the highest density of Starbucks per capita in the world (second largest? Las Vegas!). All those numbers where Seattle has half of what New York has really mean that on a per-capita basis, Seattle has roughly 4.5 times as much!

Even in terms of venture capital, things look pretty good when you consider the difference in population: Seattle’s $671 million end up being more than 3.5 times the amount per capita than New York’s $1.6 billion.

What to Compare

Comparing numbers requires understanding, and controlling for, context. Just throwing around raw numbers often leads to wrong conclusions, or shows patterns that are already known (New York is bigger than Seattle, more crime occurs where there are more people). Not accounting for those is not just a small lapse, it’s wrong.

The more complicated question is, what to compare to? Daily riders are a good first step when looking at the subway crime data, but there are caveats. The analysis would be much better with more, and more fine-grained, data. How should two cities be compared? Is population the right metric? What about GDP? How do we meaningfully compare bike lanes between New York and Seattle, for examples? What can we use to make them comparable? Street miles? Number of bicycles? Number of bicycle miles ridden? Where do we get that data from?

But even when the comparison data is not perfect, some normalization is better than none at all. And it is important to understand the limitations and uncertainties of the analysis, even when perfect comparison data is not available.

Teaser image: Heatmap, by Randall Munroe/xkcd

Review: Kraak, Mapping Time

Thu, 06/26/2014 - 05:01



Can you write an entire book about a single chart? Even if that chart is supposedly the best one ever? Menno-Jan Kraak’s new book, Mapping Time: Illustrated by Minard’s Map of Napoleon’s Russian Campaign of 1812, discusses the historical context of Minard’s work, his, life, and walks through a number of design exercises to show the same or similar data in different ways.

The graphic, which Tufte says “may well be the best statistical graphic ever drawn,” is also often referred to as Napoleon’s March. It depicts the size of Napoleon’s army during the Russian Campaign of 1812, which Napoleon began with over 420,000 men, and ended with only 10,000.

Personal Connection

Kraak has something of a personal connection to the graphic, because a brother of his great-great-grandfather, Gerrit Janz Kraak, fought in the Russian Campaign and died in the Battle of Berezina. That gives some structure to the book and provides a focus on that particular battle. That makes sense too, since it was the one where Napoleon lost the half his remaining men during a two-day battle.

The writing is generally clear and concise. The format of the book is of the coffee-table variety, which is a bit odd for its content. It does give Kraak the room for a very well-designed and laid-out book, though: text generally refers to images on the same or facing page, and few paragraphs break across pages.

There is a lot of additional historical context that goes beyond anything I have seen so far. That is true both for Minard’s background and earlier work, as well as the Russian Campaign. Kraak also clears up a few misunderstandings, including the losses at Berezina. Many accounts have his men fall through the ice, when the problem actually was that it wasn’t nearly cold enough for ice cover and bridges had to be built.

After the description of the graphic and its background, Kraak broadens the scope to talk about how to depict time and space in maps. That discussion always uses the data points from Minard’s chart as the example, and thus stays very well focused. Some of the techniques are more interesting and illuminating than others. In particular, I find the space-time cubes to be largely ineffective, and the same goes for cartograms. But some of the other maps help see some aspects of the data more clearly, like the speed of the army and the time spent in different places.

Presentation vs. Analysis

While I appreciate a self-confident writer, I find him to be a bit too certain that Minard would have found his analysis useful (p. 94). It’s certainly possible, but all signs point to Minard having a very specific message in mind, and all the added data and complexity would have taken away from that. Kraak acknowledges the message at one point, but never seems to wonder if his exercises would have actually furthered Minard’s goals.

What is striking is that none of the reworkings and variations approach the elegance and effectiveness of the original. That is also true of Michael Friendly’s collection of Re-Visions of Minard, which are all terrible (that is not Friendly’s fault, to be clear, he only collects them). Some of Kraak’s exercises also provide context that is irrelevant or misleading, like putting the diagram on top of a contemporary map (too much information), a recent satellite photo (irrelevant), or a recent map (misleading). In many cases, there is simply too much information, like lots of troop movements of the different corps of Napoleon’s army, which don’t help illuminate the story without a lot of additional context (i.e., why did they occur?).

Kraak is simply too focused on the mapmaking part, and doesn’t consider the importance of making a point (though he acknowledges Minard’s intent). That is understandable, since it’s the typical approach in visualization and very common in cartography to go after analysis. But that doesn’t seem to be what Minard was trying to do. This is also clear in his final reflections on whether the map really is the best one ever, which get tangled up in the discussion of other techniques and additional data, when it’s really a matter of expressing a point through data.


While it is easy to criticize each individual approach to depicting the data in a different way, they all illustrate what makes that particular graphic such a classic. The book provides an appreciation of the graphic and a world of context around it. Sometimes, it’s important to see many bad variations to understand why a particular original is so good.

All of that does not take away from the amazing amount of context that Kraak adds to the discussion, and alternatives that he points out. The book also works as an introduction to cartography and visualization, based on Bertin’s principles, when it builds up maps from layers and discusses different encodings for the data. It even goes beyond the usual approach by showing bad examples and why they are not successful.

That makes this book successful not just as a description of one of the most famous graphics, but also useful as a learning tool. Now somebody needs to write a similar book about Nightingale’s chart.

Menno-Jan Kraak, Mapping Time: Illustrated by Minard’s Map of Napoleon’s Russian Campaign of 1812. Esri Press, 2014.

When Bars Point Down

Mon, 06/16/2014 - 03:35



It’s so simple it feels entirely trivial: bars in a bar chart pointing down instead of up. But the effect can be striking. And it’s not as obvious when to show downward-pointing bars as it might seem. The pure visualization point of view is that bars point up for positive numbers and down for negative ones (or right and left, respectively, for horizontal bar charts).

But there is more to it than that when we think about what the graphic conveys beyond just the numbers. We are used to bars pointing up, that’s what we usually see. Downward-pointing bars are unusual and surprising – in particular, when all or most of them point down, it feels odd and you sense that something’s wrong. There are also some interesting metaphorical implications.

Simon Scarr, Death Toll Chart

The first time I saw this effect was at Malofiej 2012, where one of the entries was Simon Scarr’s Iraq’s Bloody Toll, which he had created for the South China Morning Post. I was shocked. It was such a simple effect, and incredibly striking. Seeing it in print, on a full newspaper page, made it even stronger. The graphic ended up winning a silver medal, and I’m really sorry that it didn’t make gold (I was on the jury and argued for gold, but didn’t fight hard enough for it).

What makes this so special? It’s the striking effect of the bars, that look like blood running down the page. It sends a clear message about what Scarr wants you to think. Those bars aren’t pointing down because they depict a negative number (while you could argue they do, nobody else depicts casualties like that), they point down because of a very deliberate and unusual design choice.

Obama Unemployment Chart

An older example uses the same idea to great effect: The Obama Administration Job Loss chart, also known as The Bikini Chart. It’s not quite so striking, but I like how it implicitly tells you how to read it.

The bars here represent the number of jobs lost in the U.S. during the recession, and who was in power. The message is clear: the rate of job loss was increasing under Bush, and decreased under Obama. Or, if you’re not clear what the chart even represents: things were getting worse under Bush and are now getting better under Obama.

More Bars Pointing Down

The Economist has created a number of charts with downward-pointing bars. Some of them are more obvious, like this one on the severity of financial crises.

But with others, the choice is much more deliberate and unusual. This stark depiction of the number of executions uses the same metaphor of bars pointing down for positive numbers (of deaths), and was undoubtedly inspired by Scarr. In addition to the direction, the vertical scale was also clearly chosen to make a point.

Florida Gun Deaths

When looking at such a chart idea, it is not only instructive to look at examples that do it well, but also at ones where it fails. A chart made the rounds in early April that used the same idea for gun deaths in Florida. It was widely derided as bad, misleading, and politically motivated (see the great summaries of the controversy by Andy Kirk and Alberto Cairo). While I agree that it didn’t succeed, the amount of hate poured on it (and poor Christine Chan) was unwarranted.

The problem with this chart is that it appears that the data is shown in the white part, when it is actually the red area (measured from the top down). Most readers will only realize that when they look at the scale on the left, or try to square the chart with the article it illustrates. What the chart seems to show is that the number of deaths fell dramatically after the Stand Your Ground law was enacted in Florida, when the opposite was the case.

What made this chart fail? Part of it is that it’s an area chart. The difference between a line and a bar is how it encodes information: the bar encodes it in its length, which is anchored on a baseline. It’s therefore quite easy to see where it originates. The same isn’t true for a line chart, which encodes the data in position. There is no clear indication of what the position is relative to, and in practically all cases it’s from the bottom of the chart up.

The other part is visual design. The line at the bottom appears like an axis, and makes it seem obvious to read the white area as the foreground. The darker red area easily turns into the chart background. That is helped by the annotation, which sits on top of what is the chart area, rather than in the white background. Just moving the annotation and changing that bottom line would likely have made it much easier to read the chart correctly. A more heavy-handed way would have been to add arrows pointing down at the beginning and end of the time axis.

Visualization Beyond Showing Numbers

Showing data isn’t very difficult, but there are many clever and subtle ideas that can change the message and the way a chart is read. What do you want to stress? What do want people to take away? What is your intent in showing these numbers?

The simple decision to have bars point down instead of up draws attention and communicates a message beyond the pure numbers: something is wrong. It’s amazing how loudly and clearly such a simple change can speak.