Subscribe to EagerEyes.org feed
Updated: 5 days 47 min ago

The VIS Sports Authority

Sun, 10/19/2014 - 23:35



When you think of a conference, does sitting around a lot come to mind? Lots of food? Bad coffee? No time to work out? For the first time in VIS history, there will be a way to exercise your body, not just your mind. The VIS Sports Authority, which is totally an official thing that I didn’t just make up, will kick your ass at VIS 2014.

There will be two disciplines: cycling and running. Jason Dykes is running the cycling team, and I will be driving the runners.

Le Tour de VIS

Jason is way more organized than I am, having put together not just a real website with a logo, but actually ordered bike jerseys. Cycling has somewhat more complicated logistics though, so that is certainly a good thing. I hear Jason has even picked out the soundtrack for the race already.

The Vélo Club de VIS will embark on Le Tour de VIS (this is apparently named after some sort of bike race) on the Saturday after the conference, November 15.

Go to one of the pages linked above to get more information, like a map of the planned route, and to sign up.

VIS Runners

The running will be a bit more low-key. I couldn’t think of a better name than VIS Runners, so let’s just run with that (unless you want me to call us Eager Runners).

However, running will not happen after the conference, but during. Since the receptions and parties are in the evenings, it makes the most sense to go out in the mornings. My current plan is to meet at the conference hotel at about 6:30am, then run for about an hour, so we’ll be back by 7:30.

For the distance, I’m thinking no more than 6 miles/10 kilometers, but that can be adjusted. We probably won’t do more than three runs, and in particular will likely skip Thursday (after the reception Wednesday night).

The course should be different every day to get some variety, and will depend on the distance people want to go. If you’re a local or just know your way around Paris, I’d appreciate your input in the route planning, too!

I’m embedding a form below (also available here) to collect some information about when and how far people want to go, and to get people’s names so I can follow up later.


Large Multiples

Mon, 10/13/2014 - 03:43



Getting a sense of scale can be difficult, and the usual chart types like bars and lines don’t help. Showing scale requires a different approach, one that makes the multiplier directly visible.


In the U.S., CEOs on average make 354 times as much as workers, according to this recent posting on the Washington Post’s Wonkblog. That is an astounding number. Put differently, a CEO makes in one day almost as much as the worker makes in an entire year. How do we show this enormous difference?

Roberto A. Ferdman at Wonkblog shows the numbers as a bar chart.

The bars compare between countries, but I was interested in the comparison between the worker and the CEO. Just how much more is 354 times more? This chart doesn’t tell me that.


An article on Quartz from late last year looks at similar data, and translates it into how many months workers at different companies would have to work to make the same as the CEO does in one hour. The disparities in these examples are even more staggering, since while the Wonkblog chart above looked at averages, Quartz used specific – extreme – examples. For example, McDonald’s CEO makes 1120 times what a McDonald’s worker makes. This is shown as a sort of calendar that has months marked for how long the worker needs to work to make that much.

While that illustrates the time, it kind of misses the point. Showing days when the comparison is hours understates the true magnitude by a factor of eight (assuming an eight-hour work day). Why not show the same units?

Large Multiples

The idea of showing the number of days is good, however, and I wanted to apply it to the Wonkblog numbers. So I built a little unit or multiples chart for this purpose.

I originally had included a bar chart as well as the unit chart, but based on Twitter feedback, decided to remove it. This focuses the chart on its main message, even if it makes comparing between countries more difficult. That comparison is not really all that interesting anyway, but rather the enormous disparity in and of itself.

While I was building an interactive chart, I added a bit of animation. The bubbles building up is meant to make the number a bit more tangible by also translating it into time: you have to wait longer to get the full value the larger the actual number is. This makes you feel the difference a bit more than a simple chart. I stole this idea from the UK Office of National Statistics Neighbourhood Quiz.

Click the image below to go to the interactive version of the chart. Let me know what you think!

Eight Years of eagereyes

Thu, 10/02/2014 - 05:20



What is the purpose of blogging about visualization? Is it to make fun of the bad stuff? Is it to point to pretty things? Is it to explain why things are good or bad? Is it to expand the landscape of ideas and break new ground? Or is it to discuss matters at great length that ultimately don’t matter all that much?

I criticize things, and I think it’s important to do that. I don’t regret any of my postings, however strong they may have been, and however mean they may have sounded. It was all done in good faith and with the intent to point out issues and get people to pay attention.

But increasingly, I’m questioning the thinking that some of that criticism is coming from. I’m not arguing against any particular issue people like to bring up, but I am starting to wonder how much of it is simply coming out of narrow-mindedness and stubbornness. How much of it would be obviated by sitting back, taking a deep breath, and trying to see things from a different angle?

This is not just a question of tone and intensity, but one that goes much deeper: how much do we really know? When you start to ask that question in visualization, it becomes clear very quickly how shockingly little we actually really understand. Going on and on about pie charts? Point to a paper that’s actually showing that they’re bad! Yes, such a paper exists. But how many studies have shown the same thing? Not that many. And it gets much worse for things like 3D bar charts, etc. There is very little support for the religious zealotry with which we like to damn these things.

Then there is the  question of different goals. There isn’t just one use for visualization, and things created for different purposes need to be judged against different standards. It’s all about trade-offs and making decisions. An audience of readers on the web is going to need a different approach than an audience of experts who know the data really well and have a vested interest in digging deeper. An interactive piece on a news media website will need to be much more compelling than a corporate dashboard if anybody’s going to actually bother doing something with it. There is not just one purpose, or one audience, or one way to do things.

It’s encouraging to see the huge interest in visualization. And it’s even more encouraging to see some of the recent and upcoming work on rhetoric, persuasion, and related questions. Because it matters. Communication matters. Data matters. Visualization matters.

Discussing visualization needs to matter too. But it can only do so if it comes from a place of understanding, respect, and an open mind.

Beyond the Knee-Jerk Reaction

Tue, 09/16/2014 - 03:12



There is a tendency to just reflexively make fun of certain types of charts, in particular pie charts and 3D charts. While that is often justified, there are also exceptions. Not all pie charts are bad, and not all 3D charts are terrible. But to spot those outliers, we have to suppress the knee-jerk reflex and give them a moment of thought before ripping them apart.

The Chart

About two weeks ago, I posted this chart on Twitter after seeing it in the Wired iPad app (September 2014 issue).

Yes, it is a 3D area chart. The vertical axis is the average salary paid in a number of sectors over time. The one horizontal axis on the left is the time axis, showing 30 years from 1983 to 2013. The other horizontal axis divides the chart up into four elements for four sectors: technology, white-collar, manufacturing, and sales. That axis has a second encoding in the width of the “mountains,” which represent the fraction of the workforce in each of those sectors.

The Good

So there’s a lot of data here. You can see that the tech sector pays a lot more than the others, roughly twice what sales pays, and a good 50% more than manufacturing or white-collar jobs. You can also see the effect of the recession in the ripples along the tops of the mountains, with an interesting lag between white-collar/manufacturing and sales.

I also have to admit to being quite surprised to see how small the tech sector really is: only 7% of the workforce, up from 4% 30 years ago. It’s sometimes hard to remember that there’s a world beyond technology when you’re working at a software company and spend your days on Twitter. White-collar jobs have grown to roughly make up for the loss in manufacturing, but not quite, while the percentage of people in sales has not changed.

Not all of that comes from the chart, it certainly requires some reading of the numbers, in particular for the width of the mountains. But the information is there and it’s not hard to read. The reason for posting this was my surprise at finding myself spending several minutes with this chart and finding it quite informative and fun to explore. There is a bit of interaction too, when you tap on the plus signs, but those don’t give you much additional information.

The Bad

What is wrong with this chart? Sure, it’s 3D. You can’t precisely read the numbers. What was the average salary for manufacturing jobs in 1992? You can’t read that with any sort of precision. 3D is wasteful, you could show more data in that space. But who cares? That is not the point of this chart. You can see the development over time, that’s what matters. And the chart does not seem to wildly distort the reading of those values that are readable (which is a common issue with 3D charts).

I also think that this is a good way to present what are basically eight time series (salaries and workforce percentages for four sectors) in a very concise way that works well in a static image. Of course this could be broken up into two or even three charts, but you would lose some of the cohesion the 3D gives you here. And it would be a lot less fun to explore. The lines for workforce percentage would also look extremely boring (they seem to be changing at a fairly constant pace, and certainly don’t change direction). If you care not just about representing the data but also capturing readers’ interest, this is the better chart. It certainly worked in my case.

A Smarter Discussion

But beyond all those reasons, I just want to see a more nuanced and informed discussion of these things. It doesn’t take much intelligence to sneer at every 3D chart and every pie chart that floats by on Twitter. But things are a bit more complicated than that, and these things do have their place. And just throwing some supposed absolute rules around doesn’t do anybody any favors.

Perhaps Christopher Ingraham was right.

@eagereyes Twitter is only for saying mean things about charts, Robert

— Christopher Ingraham (@_cingraham) September 6, 2014

But I hope that we can get to a point where we can have a more intelligent, nuanced, and respectful discussion. We’re not going to make much progress if we just keep rehashing the same old ideas without putting any new thoughts into them.

The Semantics of the Y Axis

Mon, 09/08/2014 - 03:53



The vertical axis is not just important because it embodies one of the most important visual properties, but also because it is much more semantically loaded than the horizontal. Not only does the right choice of mapping help with reading a chart, it can also be confuse people if done wrong.

It’s not a coincidence that the vertical is so important for us. An animal that is lying on the ground is dead or sleeping, that’s important to know. Vertical movement is also much harder than moving in the other two dimensions, and fast vertical movements can kill us. That is why we overestimate heights: better be scared of a jump that isn’t all that dangerous than taking it lightly and getting injured or killed.

We also have some very strong ideas about the vertical direction. Things moving up are generally good, things moving down less so. Being up (standing, walking, moving) is good, being down (lying, sick, dead) is not. We derive many of our metaphors from this fundamental difference too: being down meaning being sad, things looking up or moving up meaning they are good or getting better. Up also means more: more things being stacked or heaped up means more vertical space being used, and more is usually better, so more is up.

Jawbone UP’s Sleep Tracking

Jawbone wrote a blog posting about when people slept during the soccer world cup according to the data they were gathering from users of their activity tracker armband. The tracker is called UP, which causes some interesting issues parsing the axis labels in these charts.

Parsing “% of UP wearers asleep” has you going back and forth between two interpretations: UP meaning people being up/awake, but then you read “asleep.” The number is encoded on the vertical axis as more people meaning the line going higher. So more up meaning more people asleep, fewer people being up. I remember some confused tweets from people struggling with this when this made the rounds.

Jawbone also seems to have noticed, since in their recent posting on the Napa earthquake, they flipped the axis to make the semantic connection easier to follow. Now it’s “% UP wearers awake,” which makes a lot more sense. More up, more people are awake or, well, up.

The archetype of these visualizations, the New York Times’ How Different Groups Spend Their Day also works like this: the bottom-most layer, and thus the baseline of sorts, is sleep. As it should be.

Which Quadrant

This chart of men’s vs. women’s earnings that I wrote about recently also uses the vertical axis in a simple, yet smart way. It has men’s earnings on the horizontal axis, and women’s on the vertical. That is the only way this makes sense, even if technically the other way around would be just as correct.

The difference is the message the majority of the points send. If women’s earnings were on the vertical axis, those points would be in the upper left quadrant. Up is good, right? So where’s the problem? Placing them in the lower left makes this much more obvious to read. The lines representing women making 10%, 20%, and 30% less also would be quite strange if they were to the top right of the main diagonal.

Bar Charts

I already wrote about this topic in the specific case of bar charts, but it bears repeating. Bars pointing down are unusual, and they grab the viewer’s attention. They can help get a point across and help people read the chart more easily.

Larger numbers being up in line charts, bar charts, scatterplots, etc., may be the default in practically all visualization tools (and that makes sense), but it should not just be accepted without thinking about it. The vertical direction should be chosen with care, because it communicates a lot about how to read a chart. And getting it wrong can cause considerable confusion.

My Favorite Charts

Thu, 09/04/2014 - 04:38



There are many charts I hate, because they’re badly done, sloppy, meaningless, deceiving, ugly, or for any number of other reasons. But then there are the ones I keep coming back to because they’re just so clear, well-designed, and effective.

All of these are a few years old. Like a fine wine analogy that I could insert here, it probably takes a while for a chart to come up again and again in conversation and when looking for examples to realize how good it is.


My favorite scatterplot, and perhaps my favorite chart ever, is Why Is Her Paycheck Smaller? by Hannah Fairfield and Graham Roberts. It shows men’s versus women’s weekly earnings, with men on the horizontal axis and women on the vertical. A heavy black diagonal line shows equal wages, three additional lines show where women make 10%, 20%, and 30% less. Any point to the bottom right of the line means that women make less money than men.

The diagonal lines are a stroke of genius (pun fully intended). When you see a line in a scatterplot, it’s usually a regression line that models the data; i.e., a line that follows the points. But such a line only helps reinforce the difficulty of judging the differences between the two axes, which is something we’re not good at, and which is not typically something you do in a scatterplot anyway.

But the diagonal line, as simple as it is, makes it not just possible, but effortless. It’s such a simple device and yet so clear and effective. All the points on the line indicate occupations where men and women make the same amount of money. To the top left of the line is the area where women make more money than men, and to the bottom right where women make less.

The additional lines show 10%, 20%, and 30% less for women. If it’s hard to tell if a point is lying on the main diagonal of a scatterplot, it becomes impossible to guess the percentage it is off. The additional lines make it possible to guesstimate that number to within a few percent. That is a remarkable level of precision, and it is achieved with three simple lines.

There is some interactivity: mousing over points brings up a tooltip that shows the occupation the point represents and how much more one gender makes than the other. Filters in the top left corner let you focus on just a small number of occupations, which include annotations for a few select jobs.

But the key element is the inclusion of the reference lines that help people make sense of the scatterplot and read it with a high level of precision. Simple but effective, and powerful.

Line Chart

My favorite line chart is The Jobless Rate for People Like You by Shan Carter, Amanda Cox, and Kevin Quealy. This chart is somewhat ancient, having been created in Flash and showing unemployment data from January 2007 to September 2009. But its brilliant design and interaction make it timeless.

It’s a line chart, but with a twist. The first thing you see is the heavy blue line, showing the overall unemployment rate. But there are more lines in the background, what are those? So you mouse over and they respond: they light up and there’s a tooltip telling you what they represent. Each is the unemployment rate for a subset of the population, defined as the combination of race, gender, age group, and education. How are hispanic men over 45 with only a high school diploma doing compared to the overall rate? What about black women 15–24? Or white college grads of any age and gender?

Clicking on a line moves the blue line there so it’s easier to see, but the overall rate stays easily visible. The y axis also rescales nicely when the values go above what it can currently display.

In addition, the filters at the top also respond to the selection to show who is selected. Clicking around inside the chart updates them. Hm, so maybe I can use those to explore too? And of course you can, broadening or narrowing your selection, or clicking through different age groups of the same subset of the population, etc.

The Human-Computer Interaction field has a nice term for an indication of more data and interaction: information scent. This is usually used with widgets, which indicate where more information can be found (like the little tick marks on the scrollbar in Chrome when when you search within the page). What makes this chart so good is its clever use of information scent to entice viewers to dig deeper, explore, and ask questions.

It also brilliantly and clearly demonstrates the fact that the overall unemployment rate is a rather meaningless number. The actual rate in your demographic is likely to look very different, and the range is huge. This was the inspiration for my What Means Mean piece, though I don’t think that was nearly as clear as this.

The chart shows interesting data, explains a somewhat advanced concept, and invites people to interact with it. This comes in a package that is understated and elegant in its design. Best line chart ever.

Bar Chart

I have already written about the Bikini Chart, and it remains my favorite bar chart. It’s an incredibly effective piece of communication, and it’s all just based on a simple time series. The fact that the bars point down clearly communicates how it is supposed to be read: down is bad, less down is better than more down.

Bar charts are not exactly a common medium for artistic expression, but the designers of this chart managed to subtly but clearly get a message across.

Bubble Chart/Animated Scatterplot

Animated scatterplots may not have been invented by Hans Rosling and gapminder, but they certainly were not a common thing until his TED talk in 2007. And while it may seem a bit facetious to point to the only reasonably well-known example of a particular chart type as my favorite one, this is clearly one of my favorite charts, no matter what type.

The animation may seem like a bit of a gimmick – and it has been criticized as not being terribly effective –, but it works to communicate a number of important pieces of information.

The main piece of information, of course, is change over time. How have different countries changed in terms of their wealth, healthcare, etc.? This is reasonably effective, because there are trends, and many countries follow them. The outliers are reasonably easy to spot, especially when you can turn on trails and replay the animation. It’s not always immediately possible to see everything, but it does invite people to play and explore.

But then, there are the explanations. There is the clever animation that constructs the two-dimensional scatterplot from a one-dimensional distribution. There is the clever drill-down animation that breaks continents down into countries, and countries down into quintiles, to show the enormous range of values covered by each. This is not just a simple data display, but a way to introduce people to statistical concepts and data operations they may have heard of but don’t understand (drill-down), or never have heard of in the first place (quintiles).

Rosling’s video, and the gapminder software, not only introduced millions of people to data they knew nothing about (the video has over 8.5 million views!), it also demonstrated how a compelling story can be told without a single photograph or other image, just with data. That is an incredible achievement that opened our eyes to the possibilities of data visualization for communication.

Appreciating Good Work

It’s easy to find, and make fun of, bad charts. But between all the pie chart bashing and general criticism of bad charts, it is equally important to find the good examples and try to figure out what makes them work so well. Even if it may be more fun to beat up the bad examples, we will ultimately learn more from understanding the design choices and ideas that went into the good ones.

What is Data Journalism?

Wed, 07/30/2014 - 05:03



Is a data journalist one who unearths the data, who finds the insights in the data, who finds the right way to visually communicate the data? The answer is, of course, all three. But let’s tease them apart and look at each separately.

Unearthing the Data

First, the data has to be found. And finding, in the journalism context, doesn’t always mean just scouring the web. There may be sources nobody else has access to. Data may have to be wrangled out of the government’s hands with Freedom of Information Act (FOIA) or similar requests. That data then might need to be cleaned, processed, sometimes even made machine-readable in the first place.

Cleaning data is not easy, and can be incredibly time-consuming and error-prone. It requires good knowledge of data cleaning tools, scripting languages, optical character recognition (OCR), and the common pitfalls of different data formats and types. It can be difficult to verify that cleaning the data has not inadvertently destroyed or skewed it in some way.

One example I am familiar with is the work Sarah Cohen and her colleagues did for their Careless Detention series in The Washington Post in 2008. They collected data on the deaths of immigration detainees (using FOIA requests, etc.), and were able to see regional patterns on the resulting data of what they classified as questionable deaths on a simple map. The resulting story was based on the people rather than the data, but the data led to the story.

Finding the Insights in the Data

Once the data has been found, it needs to be analyzed to find out what it actually contains. Most of the time, there is nothing interesting in it. But in those cases where a discovery is made, it can make for a great story.

The skills required here are quite different from the data digging. The key skill is data analysis: statistics, data exploration, hypothesis testing, etc. It can also require domain knowledge, e.g., about economics when the story is about unemployment, etc.

One of my favorite examples is Hannah Fairfield’s Driving Safety in Fits and Starts in The New York Times two years ago. The data had to be collected, but was publicly available. The key part that made for the story, however, was finding explanations for the patterns that emerged (and also a very compelling visual representation).

Communicating the Insight

Given the insight, getting that across should be easy, right? Well, no. That’s the big mistake many academics make, and this is where you can see the most impressive work.

Perhaps my favorite example of that is Jonathan Corum’s whale/lunge feeding graphic (discussed in his phenomenal Tapestry talk last year). The data was collected by scientists, and they had already created a chart for a paper. Corum’s insight was to put something back into the graph that the scientists had left out: the depth the whale was diving to. Perhaps this was obvious to the scientists, but certainly not to the readers of The New York Times.

A more recent example was on the behavior of dogs in different settings. The data again came from a paper, which even included almost the exact same chart that was used in the NY Times piece. But only almost. The key differences are what turn a boring bar chart into an interesting, readable one: color, spacing, and some cute drawings. Enough for Kaiser Fung to use it as an argument that visualization can be worth paying for.

However, Greg McInerny argues that the NY Times version loses some important elements of the chart, namely the statistical significance of the differences. He also proposes some alternative designs that retain most of the stylistic improvements, while adding a bit more information.

Either way, the key part here is finding an interesting story like a gemstone, pulling it out of the surrounding material, and making it shine. It doesn’t always have to start from the raw data, though.

What Makes A Data Journalist?

All these examples are pieces of data journalism. Not all involve visualization. Not all entailed digging for data. Not all even require finding the insight yourself.

What data journalism requires, then, is a broad mix of skills and instincts. Not all are necessarily needed in all cases. But you never know which ones a story will require. Many of the technical and math skills are still rather unusual among the people typically working in journalism. That makes this new direction so interesting but also so problematic: how do we know if we can trust the work produced? Alberto Cairo is skeptical, and wants data journalism to up its standards.

But in a way, data journalism is the logical extension of what journalism has been all along: collecting facts and data, understanding the implications, finding the story, and reporting it. The tools and materials are changing. But soon, all journalism will be data journalism in one form or another.

Putting Data Into Context

Thu, 07/24/2014 - 04:59



Raw numbers are easy to report and analyze, but without the proper context, they can be misleading. Is the effect you’re seeing real, or a simple result of the underlying, obvious distribution? Too many analyses and news stories end up reporting things we already know.

This is a particular issue with data that has a spatial component. When the data is shown on the inevitable map, you often just see a distribution of people. Where there are more people, there are more tweets, there’s more crime, there are more customers, there are more coffee shops, etc. Many maps in fact show nothing but the underlying distribution of people. As usual, xkcd has captured the issue beautifully.

Crime on New York City Subways

An examples for this is a map recently published by New York Daily News showing the amount of crime in subway stations in New York City. Each bubble on the map shows one station, with the size of the bubble representing the number of crimes. Mousing over the bubbles soon reveals that many of the higher-crime stations are the ones with more than one line passing through (and thus likely more people).

While it is possible to switch the map to a view that shows the number of incidents per 100,000 people passing through the station, the story leads with the raw numbers. The data was actually collected over a period of five years (July 2008 to June 2013), which increases their magnitude. Reporting crimes per year seems like a more honest way of looking at these numbers.

The station with the most crime (Times Square/Port Authority) had 1791 incidents during that time, that’s less than one per day over five years. Incidentally, that is also New York’s busiest station, with more than 166,000 people passing through every single day. So that seemingly scary number of crimes comes out to one for every 170,000 people, which puts it in the safer half of stations (rank 219 out of 420 in the dataset).

The number of people passing through each station is actually included in a little visualization in that article, though it’s not shown directly. Since they use Tableau Public for that, I was able to tease it out and create a pair of maps comparing the total number and the per–100,000 numbers side by side. Click on a bubble in one of the maps below to highlight it in both. There is also a tooltip listing the numbers.

The difference between the total and the per–100,000 number is important not just because it changes the picture. It’s the much more interesting number. The total number of crimes means nothing for the individual. What is much more relevant is how many crimes happen per person, because that provides a much closer estimate of each individual’s risk of becoming a victim. Large total numbers make for great headlines (Over 191,000 crimes total!), but they’re largely meaningless otherwise.

The per–100,000 number varies widely even within the top few stations based on total number of incidents. Number 2, 125th St, has over six times the rate per 100,000, and number 5, Broadway Junction, has almost ten times the rate (but less than half the total number) compared to Times Square. The distribution of crime per 100,000 on the map is also clearly different, with the Bronx, Brooklyn, and the Rockaways more dangerous than Manhattan.

There is also an interesting outlier, Broad Channel, which is discussed in the article. It is less of an outlier than it appears here, though, because the number of daily riders per station in the dataset is people going through turnstiles, which does not count people transferring between lines. Those make up a large portion of the people in some of the stations, like Broad Channel, though.

So even this first step of putting the data into context is not complete, but would benefit from better data about the actual number of people passing through each station per day. And there is clearly a lot more data that could be collected here: time of day, how busy each station is during those times, what kinds of crimes occur during what times of day, etc.

Comparing Cities

Another example is this New York Times story from over a year ago, which still gets my blood boiling when I read it. It compares New York City to Seattle in terms of their “geek appeal,” and talks about data science, among other things. The irony is completely lost on the author, though, who uses absolute numbers to compare two cities whose populations are vastly different.

Did you know that Seattle had only 139 Starbucks in 2013, but New York had 271? Or that there were a puny 85,000 IT workers in Seattle, compared to 168,400 in NYC? What a provincial little town that Seattle place is!

But when you account for the fact that New York is almost nine times as large, things look a bit different. Seattle has the highest density of Starbucks per capita in the world (second largest? Las Vegas!). All those numbers where Seattle has half of what New York has really mean that on a per-capita basis, Seattle has roughly 4.5 times as much!

Even in terms of venture capital, things look pretty good when you consider the difference in population: Seattle’s $671 million end up being more than 3.5 times the amount per capita than New York’s $1.6 billion.

What to Compare

Comparing numbers requires understanding, and controlling for, context. Just throwing around raw numbers often leads to wrong conclusions, or shows patterns that are already known (New York is bigger than Seattle, more crime occurs where there are more people). Not accounting for those is not just a small lapse, it’s wrong.

The more complicated question is, what to compare to? Daily riders are a good first step when looking at the subway crime data, but there are caveats. The analysis would be much better with more, and more fine-grained, data. How should two cities be compared? Is population the right metric? What about GDP? How do we meaningfully compare bike lanes between New York and Seattle, for examples? What can we use to make them comparable? Street miles? Number of bicycles? Number of bicycle miles ridden? Where do we get that data from?

But even when the comparison data is not perfect, some normalization is better than none at all. And it is important to understand the limitations and uncertainties of the analysis, even when perfect comparison data is not available.

Teaser image: Heatmap, by Randall Munroe/xkcd