Subscribe to EagerEyes.org feed
Updated: 3 days 18 hours ago

Putting Data Into Context

Thu, 07/24/2014 - 04:59



Raw numbers are easy to report and analyze, but without the proper context, they can be misleading. Is the effect you’re seeing real, or a simple result of the underlying, obvious distribution? Too many analyses and news stories end up reporting things we already know.

This is a particular issue with data that has a spatial component. When the data is shown on the inevitable map, you often just see a distribution of people. Where there are more people, there are more tweets, there’s more crime, there are more customers, there are more coffee shops, etc. Many maps in fact show nothing but the underlying distribution of people. As usual, xkcd has captured the issue beautifully.

Crime on New York City Subways

An examples for this is a map recently published by New York Daily News showing the amount of crime in subway stations in New York City. Each bubble on the map shows one station, with the size of the bubble representing the number of crimes. Mousing over the bubbles soon reveals that many of the higher-crime stations are the ones with more than one line passing through (and thus likely more people).

While it is possible to switch the map to a view that shows the number of incidents per 100,000 people passing through the station, the story leads with the raw numbers. The data was actually collected over a period of five years (July 2008 to June 2013), which increases their magnitude. Reporting crimes per year seems like a more honest way of looking at these numbers.

The station with the most crime (Times Square/Port Authority) had 1791 incidents during that time, that’s less than one per day over five years. Incidentally, that is also New York’s busiest station, with more than 166,000 people passing through every single day. So that seemingly scary number of crimes comes out to one for every 170,000 people, which puts it in the safer half of stations (rank 219 out of 420 in the dataset).

The number of people passing through each station is actually included in a little visualization in that article, though it’s not shown directly. Since they use Tableau Public for that, I was able to tease it out and create a pair of maps comparing the total number and the per–100,000 numbers side by side. Click on a bubble in one of the maps below to highlight it in both. There is also a tooltip listing the numbers.

The difference between the total and the per–100,000 number is important not just because it changes the picture. It’s the much more interesting number. The total number of crimes means nothing for the individual. What is much more relevant is how many crimes happen per person, because that provides a much closer estimate of each individual’s risk of becoming a victim. Large total numbers make for great headlines (Over 191,000 crimes total!), but they’re largely meaningless otherwise.

The per–100,000 number varies widely even within the top few stations based on total number of incidents. Number 2, 125th St, has over six times the rate per 100,000, and number 5, Broadway Junction, has almost ten times the rate (but less than half the total number) compared to Times Square. The distribution of crime per 100,000 on the map is also clearly different, with the Bronx, Brooklyn, and the Rockaways more dangerous than Manhattan.

There is also an interesting outlier, Broad Channel, which is discussed in the article. It is less of an outlier than it appears here, though, because the number of daily riders per station in the dataset is people going through turnstiles, which does not count people transferring between lines. Those make up a large portion of the people in some of the stations, like Broad Channel, though.

So even this first step of putting the data into context is not complete, but would benefit from better data about the actual number of people passing through each station per day. And there is clearly a lot more data that could be collected here: time of day, how busy each station is during those times, what kinds of crimes occur during what times of day, etc.

Comparing Cities

Another example is this New York Times story from over a year ago, which still gets my blood boiling when I read it. It compares New York City to Seattle in terms of their “geek appeal,” and talks about data science, among other things. The irony is completely lost on the author, though, who uses absolute numbers to compare two cities whose populations are vastly different.

Did you know that Seattle had only 139 Starbucks in 2013, but New York had 271? Or that there were a puny 85,000 IT workers in Seattle, compared to 168,400 in NYC? What a provincial little town that Seattle place is!

But when you account for the fact that New York is almost nine times as large, things look a bit different. Seattle has the highest density of Starbucks per capita in the world (second largest? Las Vegas!). All those numbers where Seattle has half of what New York has really mean that on a per-capita basis, Seattle has roughly 4.5 times as much!

Even in terms of venture capital, things look pretty good when you consider the difference in population: Seattle’s $671 million end up being more than 3.5 times the amount per capita than New York’s $1.6 billion.

What to Compare

Comparing numbers requires understanding, and controlling for, context. Just throwing around raw numbers often leads to wrong conclusions, or shows patterns that are already known (New York is bigger than Seattle, more crime occurs where there are more people). Not accounting for those is not just a small lapse, it’s wrong.

The more complicated question is, what to compare to? Daily riders are a good first step when looking at the subway crime data, but there are caveats. The analysis would be much better with more, and more fine-grained, data. How should two cities be compared? Is population the right metric? What about GDP? How do we meaningfully compare bike lanes between New York and Seattle, for examples? What can we use to make them comparable? Street miles? Number of bicycles? Number of bicycle miles ridden? Where do we get that data from?

But even when the comparison data is not perfect, some normalization is better than none at all. And it is important to understand the limitations and uncertainties of the analysis, even when perfect comparison data is not available.

Teaser image: Heatmap, by Randall Munroe/xkcd

Review: Kraak, Mapping Time

Thu, 06/26/2014 - 05:01



Can you write an entire book about a single chart? Even if that chart is supposedly the best one ever? Menno-Jan Kraak’s new book, Mapping Time: Illustrated by Minard’s Map of Napoleon’s Russian Campaign of 1812, discusses the historical context of Minard’s work, his, life, and walks through a number of design exercises to show the same or similar data in different ways.

The graphic, which Tufte says “may well be the best statistical graphic ever drawn,” is also often referred to as Napoleon’s March. It depicts the size of Napoleon’s army during the Russian Campaign of 1812, which Napoleon began with over 420,000 men, and ended with only 10,000.

Personal Connection

Kraak has something of a personal connection to the graphic, because a brother of his great-great-grandfather, Gerrit Janz Kraak, fought in the Russian Campaign and died in the Battle of Berezina. That gives some structure to the book and provides a focus on that particular battle. That makes sense too, since it was the one where Napoleon lost the half his remaining men during a two-day battle.

The writing is generally clear and concise. The format of the book is of the coffee-table variety, which is a bit odd for its content. It does give Kraak the room for a very well-designed and laid-out book, though: text generally refers to images on the same or facing page, and few paragraphs break across pages.

There is a lot of additional historical context that goes beyond anything I have seen so far. That is true both for Minard’s background and earlier work, as well as the Russian Campaign. Kraak also clears up a few misunderstandings, including the losses at Berezina. Many accounts have his men fall through the ice, when the problem actually was that it wasn’t nearly cold enough for ice cover and bridges had to be built.

After the description of the graphic and its background, Kraak broadens the scope to talk about how to depict time and space in maps. That discussion always uses the data points from Minard’s chart as the example, and thus stays very well focused. Some of the techniques are more interesting and illuminating than others. In particular, I find the space-time cubes to be largely ineffective, and the same goes for cartograms. But some of the other maps help see some aspects of the data more clearly, like the speed of the army and the time spent in different places.

Presentation vs. Analysis

While I appreciate a self-confident writer, I find him to be a bit too certain that Minard would have found his analysis useful (p. 94). It’s certainly possible, but all signs point to Minard having a very specific message in mind, and all the added data and complexity would have taken away from that. Kraak acknowledges the message at one point, but never seems to wonder if his exercises would have actually furthered Minard’s goals.

What is striking is that none of the reworkings and variations approach the elegance and effectiveness of the original. That is also true of Michael Friendly’s collection of Re-Visions of Minard, which are all terrible (that is not Friendly’s fault, to be clear, he only collects them). Some of Kraak’s exercises also provide context that is irrelevant or misleading, like putting the diagram on top of a contemporary map (too much information), a recent satellite photo (irrelevant), or a recent map (misleading). In many cases, there is simply too much information, like lots of troop movements of the different corps of Napoleon’s army, which don’t help illuminate the story without a lot of additional context (i.e., why did they occur?).

Kraak is simply too focused on the mapmaking part, and doesn’t consider the importance of making a point (though he acknowledges Minard’s intent). That is understandable, since it’s the typical approach in visualization and very common in cartography to go after analysis. But that doesn’t seem to be what Minard was trying to do. This is also clear in his final reflections on whether the map really is the best one ever, which get tangled up in the discussion of other techniques and additional data, when it’s really a matter of expressing a point through data.


While it is easy to criticize each individual approach to depicting the data in a different way, they all illustrate what makes that particular graphic such a classic. The book provides an appreciation of the graphic and a world of context around it. Sometimes, it’s important to see many bad variations to understand why a particular original is so good.

All of that does not take away from the amazing amount of context that Kraak adds to the discussion, and alternatives that he points out. The book also works as an introduction to cartography and visualization, based on Bertin’s principles, when it builds up maps from layers and discusses different encodings for the data. It even goes beyond the usual approach by showing bad examples and why they are not successful.

That makes this book successful not just as a description of one of the most famous graphics, but also useful as a learning tool. Now somebody needs to write a similar book about Nightingale’s chart.

Menno-Jan Kraak, Mapping Time: Illustrated by Minard’s Map of Napoleon’s Russian Campaign of 1812. Esri Press, 2014.

When Bars Point Down

Mon, 06/16/2014 - 03:35



It’s so simple it feels entirely trivial: bars in a bar chart pointing down instead of up. But the effect can be striking. And it’s not as obvious when to show downward-pointing bars as it might seem. The pure visualization point of view is that bars point up for positive numbers and down for negative ones (or right and left, respectively, for horizontal bar charts).

But there is more to it than that when we think about what the graphic conveys beyond just the numbers. We are used to bars pointing up, that’s what we usually see. Downward-pointing bars are unusual and surprising – in particular, when all or most of them point down, it feels odd and you sense that something’s wrong. There are also some interesting metaphorical implications.

Simon Scarr, Death Toll Chart

The first time I saw this effect was at Malofiej 2012, where one of the entries was Simon Scarr’s Iraq’s Bloody Toll, which he had created for the South China Morning Post. I was shocked. It was such a simple effect, and incredibly striking. Seeing it in print, on a full newspaper page, made it even stronger. The graphic ended up winning a silver medal, and I’m really sorry that it didn’t make gold (I was on the jury and argued for gold, but didn’t fight hard enough for it).

What makes this so special? It’s the striking effect of the bars, that look like blood running down the page. It sends a clear message about what Scarr wants you to think. Those bars aren’t pointing down because they depict a negative number (while you could argue they do, nobody else depicts casualties like that), they point down because of a very deliberate and unusual design choice.

Obama Unemployment Chart

An older example uses the same idea to great effect: The Obama Administration Job Loss chart, also known as The Bikini Chart. It’s not quite so striking, but I like how it implicitly tells you how to read it.

The bars here represent the number of jobs lost in the U.S. during the recession, and who was in power. The message is clear: the rate of job loss was increasing under Bush, and decreased under Obama. Or, if you’re not clear what the chart even represents: things were getting worse under Bush and are now getting better under Obama.

More Bars Pointing Down

The Economist has created a number of charts with downward-pointing bars. Some of them are more obvious, like this one on the severity of financial crises.

But with others, the choice is much more deliberate and unusual. This stark depiction of the number of executions uses the same metaphor of bars pointing down for positive numbers (of deaths), and was undoubtedly inspired by Scarr. In addition to the direction, the vertical scale was also clearly chosen to make a point.

Florida Gun Deaths

When looking at such a chart idea, it is not only instructive to look at examples that do it well, but also at ones where it fails. A chart made the rounds in early April that used the same idea for gun deaths in Florida. It was widely derided as bad, misleading, and politically motivated (see the great summaries of the controversy by Andy Kirk and Alberto Cairo). While I agree that it didn’t succeed, the amount of hate poured on it (and poor Christine Chan) was unwarranted.

The problem with this chart is that it appears that the data is shown in the white part, when it is actually the red area (measured from the top down). Most readers will only realize that when they look at the scale on the left, or try to square the chart with the article it illustrates. What the chart seems to show is that the number of deaths fell dramatically after the Stand Your Ground law was enacted in Florida, when the opposite was the case.

What made this chart fail? Part of it is that it’s an area chart. The difference between a line and a bar is how it encodes information: the bar encodes it in its length, which is anchored on a baseline. It’s therefore quite easy to see where it originates. The same isn’t true for a line chart, which encodes the data in position. There is no clear indication of what the position is relative to, and in practically all cases it’s from the bottom of the chart up.

The other part is visual design. The line at the bottom appears like an axis, and makes it seem obvious to read the white area as the foreground. The darker red area easily turns into the chart background. That is helped by the annotation, which sits on top of what is the chart area, rather than in the white background. Just moving the annotation and changing that bottom line would likely have made it much easier to read the chart correctly. A more heavy-handed way would have been to add arrows pointing down at the beginning and end of the time axis.

Visualization Beyond Showing Numbers

Showing data isn’t very difficult, but there are many clever and subtle ideas that can change the message and the way a chart is read. What do you want to stress? What do want people to take away? What is your intent in showing these numbers?

The simple decision to have bars point down instead of up draws attention and communicates a message beyond the pure numbers: something is wrong. It’s amazing how loudly and clearly such a simple change can speak.

Data Stories Episode About Data Storytelling

Thu, 04/17/2014 - 04:35



How is it possible that it has taken a podcast called Data Stories 35 episodes to get to the topic of data storytelling? Alberto Cairo and I helped get the topic straightened out, and I think we even convinced Moritz that stories are not the enemy of exploration. It was a fun episode to record, and it touches on many interesting topics.

It all started with Moritz trolling the Tapestry conference hashtag.

#tapestryconf Honest question – any attempts to define what storytelling is in context of vis, or are we still dodging that q?

— Moritz Stefaner (@moritz_stefaner) February 26, 2014

Then he wrote a blog posting arguing that we should build worlds rather than tell stories. At the same time, he and Enrico were already talking to Alberto and me about doing an episode of the podcast about the topic.

Episode 35 of the podcast, featuring Alberto Cairo and myself as guests, was published today. Alberto has written up an entire posting about his thoughts (and stolen my amazing screenshot). I also wrote my two recent storytelling posts (worlds and stories, definition of story) in preparation for this podcast.

The resulting discussion touched on a number of topics, and we covered a lot of ground. There is certainly more to be said, but this is a great starting point. We certainly had lots of fun recording it, and I think it will be interesting to listen to.

Some may be wondering why I’m on the podcast again, since I’ve been a guest a few times before. I won’t name names here, though.

@albertocairo @eagereyes @datastories @filwd @moritz_stefaner What, is Robert co-host now w/ how many times he's been on.

— T.J. Jankun-Kelly (@dr_tj) April 7, 2014

What can I say? First, Andy Kirk is has been on the podcast more often than I have, so there. Also, if those people who complain blogged more, perhaps they would be asked to be on podcasts, too! Just sayin’.

Now go and listen to Data Stories about data storytelling.

Review: Manuel Lima, The Book of Trees

Mon, 04/14/2014 - 03:57



Trees. They’re everywhere. And not just in the physical world, but in data visualization and knowledge representation as well. This is not a new phenomenon, it goes back thousands of years. Manuel Lima’s new book, The Book of Trees, gives an overview.

Setting Expectations

This review is an example of priming. The first time I learned of the book was when Ben Shneiderman mentioned it to me as we talked at IEEE VIS in Atlanta last year. In our conversation, he referred to it as “a coffee-table book.” I don’t think he did this on purpose, but that did set my expectations.

There are many similarities between The Book of Trees and Lima’s previous book, Visual Complexity, which I reviewed for Science two years ago . The major difference is that Lima doesn’t attempt the same taxonomy he did in Visual Complexity, and which ended up being mostly disappointing. There are also no over-the-top endorsements on the back of the book that promise way too much. The result is a book that feels more coherent and complete.

Beyond The Coffee Table

Having been primed to think of it as a coffee-table book, I did not expect a deep theoretical treatment, but lots of pictures. And that is what I got. In addition, the book has a very nice introduction that describes the importance of trees throughout all cultures and religions, both in terms of their physical uses and as metaphors for knowledge, life, etc.

There is also a short chapter titled Timeline of Significant Characters, which consists of 12 short bios, starting with Aristotle and ending with Ben Shneiderman (Ben also wrote the foreword, and the book includes many examples of treemaps). It seems a bit misplaced early in the book, and might have made more sense as an appendix.

In the introduction, Lima argues that we need to look at a much longer history of visual representations than just information visualization (and “not be overly infatuated by the work created in the last decade alone”). I agree with that. However, a clearer line could have been drawn between actual data visualizations and trees that depict ideas of structure (like Darwin’s illustration for On the Origin of Species, which did not describe the evolutionary history of any particular species, but the general idea of evolution).


In addition to the introduction, there are eleven chapters talking about different kinds of tree diagrams:

  • Figurative Trees. These are the most tree-like in the way they are drawn, and many of the oldest examples are in this chapter. This is also the longest chapter.
  • Vertical Trees. Upside-down trees, the way they are commonly drawn in computer science. It turns out that there is quite a bit of precedent for these, going back many hundreds of years.
  • Horizontal Trees. All but one of these is drawn left-to-right, and there are a few that grow in both directions. This chapter also includes, at the very end, tree-browser concepts similar to the Mac Finder and Windows Explorer, respectively. Lima credits these to himself, which seems an odd choice.
  • Multidirectional Trees. Trees drawn in different directions are included here. The most obvious examples are the result of force-directed layouts, but there are also historical examples and also more modern hand-drawn ones (like Stephanie Posavec’s Writing Without Words)
  • Radial Trees. Trees laid out on concentric circles are a common idea in visualization for a variety of reasons. This chapter seems like a bit more of a mish-mash, because the layout within the circles can be very different, affording different ways of reading, interaction, etc.
  • Hyperbolic Trees. Giving these their own chapter is an interesting choice, because they are really a subset of multidirectional trees. This is a nod to interaction, which is otherwise missing in the book. It’s a short chapter, since hyperbolic trees never really took off (partly because they were patented), and never really proved to be all that useful.
  • Rectangular Treemaps. This is the first of three treemaps chapters. It starts with some historical precedents, though I doubt anybody would have recognized them as part of one class before Shneiderman’s paper. Then it’s treemaps: the original slice-and-dice treemap, squarified teemaps, the Map of the Market, cushion treemaps, and many examples of using treemaps for different kinds of data.
  • Voronoi Treemaps. It was a surprise to see a whole chapter on this niche treemap type, but they are of course very attractive. Surprising is also the number of examples Lima has managed to dig up.
  • Circular Treemaps. Calling these treemaps is at least a stretch, since they are not actually space-filling. In the introduction to the chapter, Lima first refers to them as space-filling, but then complains about their waste of space. I’d rather have seen a chapter on interaction than these mostly useless visualizations.
  • Sunbursts. Speaking of useless: sunburst diagrams are one of those neat ideas that don’t really work out in practice. The examples are all weak, in particular the 3D Sunburst.
  • Icicle Trees. Icicle trees are clearly more useful than sunbursts, since they are easier to label and navigate than the circular sunbursts. The latter are also arguably just icicles laid out in a circle. It’s kind of difficult to compete with the treemap, so these last two chapters feel a bit forlorn.

Each chapter has a little diagram showing how the type of visualization is constructed for a tree with one, two, and three levels. This is surprisingly effective, and similar to some of the illustrations in Isabel Meirelles’ book.

What’s Missing

Is this a visualization book? Not really. It doesn’t go into any detail on the actual techniques, doesn’t compare them, and more than half of its pages are devoted to tree diagrams that aren’t useful for visualizing data today.

It also entirely ignores interaction. The only time Lima talks about it is in the introduction to the Hyerbolic Trees chapter, where he says that these don’t appear much in print because they are useful when there is interaction (and only then, I might add), and are thus confined to “their natural digital domain.”

It’s too bad Lima didn’t venture a but further into that domain to illustrate more of the really interesting interactive tree visualization tools. Tamara Munzner’s TreeJuxtaposer is never mentioned, and neither is the SpaceTree, etc. There are many other examples of work that is missing, and I don’t think that Lima was going for completeness here. But ignoring interaction entirely seems like a big gap, even if it doesn’t lend itself that well to a printed book.


The book provides plenty of good material. Lima has unearthed many examples that most people likely have never seen before, both ancient and relatively recent. His introduction has also given me a new appreciation of trees as a structural metaphor. I hope that somebody will use all the examples Lima has collected for both of his books to develop a deeper understanding of the design space, beyond a list of examples.

The book succeeds as a coffee-table book, and I mean that in the best possible way. It provides a beautiful, visual overview of a large and important part of our cultural and intellectual heritage, and thus is a fantastic resource to draw inspiration from. The visualization examples are not complete, but there are many lesser-known ones that can be great starting points when researching tree visualization work – or when simply wanting to understand the history and context of tree metaphors when depicting information.

Manuel Lima, The Book of Trees. Princeton Architectural Press, 2014.

The publisher sent me a free copy of the book for this review.