Data Mining: Text Mining, Visualization and Social Media

Subscribe to Data Mining: Text Mining, Visualization and Social Media feed
Updated: 3 days 18 hours ago

Hopper - new in the travel space

Sun, 01/19/2014 - 19:24



Briefly - Hopper is something new in the travel  / local space. In their own words:

What if you could plan an amazing trip based on a vague idea — like “spring surfing in California” or “Mediterranean cruise”? What if logistical information popped up right when you needed it, so you wouldn't have to spend hours on research? This is our vision: to make planning a trip an effortless extension of discovering and exploring new places.

We spent several years experimenting with different tools, technology and algorithms to collect, organize and manage massive amounts of travel data. The result is a new kind of trip planning engine, powered by the world's largest structured database of travel information.

I've not remotely explored the site, but I see it as part of a trend which involves rich exploration experiences including plenty of imagery, the social aspect of local and specifically travel combined with smarts involving itinerary planning and travel booking. There are some similarities with the recently acquired RouteSet demo from PerceptLabs and also with the geo-microblog site Findery.

Visually, the exploration of a place on Hopper looks like this:

Which is to say - visually very rich with images provided (I assume) by the community. This wave of modern location products makes one ask the question - how important is the map for (engagement) in local search?

Right now, the site has some issues. As a signed in user I'm told to browse others' experiences and 'save' what I find interesting. Howerver, there is no mention of a 'save' action on any of the posts on the site. Consequently, it is a little hard right now to give a write up of how the site works. I do note posts have a reference to a source. Does hopper crawl these sources? or do the users cross post?

Update: regarding saving - a search on google for ' save -near' surfaces pages which contain the word save, like this one: . However, the page itself according to Chrome has no instance of the string 'save' on it. Looking at the source for the page shows that there is actually a save button and other mechanisms in place. Not sure what is amiss here. Testing on IE also fails to surface any visible save functionality.

FitBit: A great product with an even better website

Sun, 12/29/2013 - 18:36



Briefly - Wakako gave me (actually us) a FitBit for Christmas. This is a great product if you are (like me) motivated by data to take action. While I appreciate the device design (small but functional), I really like the thought that has gone in to the data presentation in the dashboard. The displays of the key variables are clean and yet subtle enough to reward interaction by revealing additional dimensions.

Review: Information is Beautiful by David McCandless

Sun, 12/29/2013 - 01:23



Information is Beautiful is a thought provoking labour of love by one of the first true data journalists, David McCandless. It is a simply structured collection of graphical interpretations of a variety of interesting statistics, factoids and opinions. It is compelling in its ability to provoke exclamations of surprise at the relationships between facts (e.g. the financial crisis costing us almost four times more than the expected total cost of the west's adventurism in Iraq and Afghanistan) as well as generating respect for the creativity and design that has gone in to presenting the information.

That being said, the book also illustrates the very tricky position of a data journalist (or whatever we eventually call those individuals who render 'information' visually). Visualization of data in the form of graphics and the expression of facts, opinions, processes, etc. in the form of information visualizations is, essentially, a new language. As consumers of this new language, we have to place a large amount of trust in the translator.

As is appropriate for a book aimed more at the coffee table than the academic library, Information is Beautiful comes with no explanation of the graphical idioms used. Nor does it come with any summary of conclusions or discussion of the implications drawn from the data or the visualization. It is more like the glossy book of fabulous beaches from around the world which contains little or no indication of where these places are or what is just out of sight, or lurking behind the scene. This is, in my opinion, a grave oversight.

For example, the first piece presents a number of types of spending (e.g. defence budgets, foreign aid payments, etc.) and compares them - via the cleverly engineered positioning of a page turn - with the cost of 'the financial crisis' (which I assume is the most recent such event). Here the intended implication is clear - the financial crisis cost a lot more than all that other stuff that you think is costing us a lot. But what is the scope of all the other stuff? The defence budgets for the US, China, the UK, Saudi Arabia and India are presented - are these the largest budgets? If so, what percentage of all defense budgets do they represent? Are there other events which provide more context (e.g. other financial meltdowns, the 'cost' of a world war, the cost of other wars). It is clear that the auther has selected the variables being compared with the recession, but without knowledge of the selection criteria it is hard to know either the intended spin, or how meaningful the conclusion that the reader is lead to might be.

The second graphic - an exploration of the values and opinions of left and right leaning political positions - suffers in a similar way from a lack of context. The graphic, for example, appears to make the statement 'right leaning governments don't interfere with [the] social lives [of their citizens]'. What are we to make of this? Is this the opinion of the author? Or is it somehow a statement derived from one of the sources quoted at the bottom of the image (wikipedia,, etc.)? As there are a number of sources, is this a consensus or is it an amalgam of the information in these sources?

McCandless presents some statistics on the structure of rape reports, prosecutions and convictions in England and Wales. I've approximated the visualization below:

Overlapping circles evoke the common concept of the venn diagram. However, here the semantics would appear to indicate that Prosecutions include rapes that are not reported, and that Convictions are exclusively obtained for non-reported rapes. I can't make sense of that.

McCandless often uses Google's Insights tool to make observations about the relative importance of various concepts. This type of analysis requires some amount of preparation for the reader. These graphs have no vertical axis label or units. Google labels the y-axis as 'interest over time' and provides a reasonable amount of explanation about the graphs and how to interpret them, including:

A downward trending line means that a search term's popularity is decreasing. It doesn't mean that the absolute, or total, number of searches for that term is decreasing.

In other words, a peak doesn't necessarily mean that there is more absolute interest - it could just as easily indicate that there is a reduced amount of interest in some other topic which therefore takes away mass from the denominator. Quite possibly the conclusions that can be drawn from these comparative time series are reasonable where there are differences between the trends (this may not be the case for compared series that show correlated peaks).

In terms of colour palette, McCandless is clearly from the Wired circa 2000 school - a school which embraces challenging colour schemes (such as white characters on a yellow background - see 'Lack of Conviction') and where the semantics of colours often trumps the contrast (for example, using a range of similar colours in a legend to a graphic leaving the reader to guess which blob goes with which meaning - see 'Most Successful Rock Bands').

Overall, this is a fascinating book. It has received popular coverage in the media and I'm sure it continues to sell well all over the globe. As a community of readers, we have to become data literate so that we can consume this type of content with the same critical eye as if we were reading the statements in dry text. It is not the information that is beautiful per se, it is the presentation of the information. I feel that this book would be far more useful if it contained a preface of some sort helping the reader to understand this new language, to educate them in the skills required to draw insights, but also to question the translation.