Data Mining: Text Mining, Visualization and Social Media

Subscribe to Data Mining: Text Mining, Visualization and Social Media feed
Updated: 49 min 13 sec ago

Metric driven Agile for Big Data

Sat, 04/20/2013 - 21:43

Categories:

Search

Working in Bing Local Search brings together a number of interesting challenges.

Firstly, we are in a moderately sized organization, which means that our org chart has some rough similarities to our high level system architecture. This means that we have back-end teams who worry mostly about data - getting it, improving it and shipping it. These teams are not sitting in the end-users laps and our customers, to some extent, are internal.

Secondly, we are dealing with 'big data'. I don't consider local as it is traditionally implemented to be a big data problem per se, however when one starts to consider processing user behaviour and web scale data to improve the product it does turn in to a big data problem.

Agile (or eXtreme programming) brings certain key concepts. These include a limited time horizon for planning (allowing issues to be addressed in a short time frame and limiting the impact of taking a wrong turn and the 'on-site customer'.

The product of a data team in the context of a product like local search is somewhat specialized within the broader scope of 'big data'. Data is our product (we create a model of a specific part of the real world - those places where you can peform on-site transactions), and we leverage large scale data assets to make that data product better.

The agile framework uses the limited time horizon (the 'sprint' or 'iteration') to ensure that unknowns are reduced appropriately and that real work is done in a manner aligned with what the customer wants. The unknowns are often related to either the customer (who generally doesn't really know what they want), technologies (candidate solutions need to be tested for feasibility) and the team (how much work can they actually get done in a sprint). Having attended a variety of scrum / agile / eXtreme training events, I am now of the opinion that the key unknown of big data - the unknowns in the data itself - are generally not considered in the framework (quite possibly because this approach to engineering took off long before large scale data was a thing).

In a number of projects where we are being agile, we have modified the framework with a couple of new elements.

Metrics not Customers: we develop key metrics that guide our decision making process, rather than rely on a customer. Developing metrics is actually challenging. Firstly, they need to be a proxy for some customer. As our down stream customers are also challenged by the big data fog (they aren't quite sure what they will find in the data they want us to produce for them), we have to work with them to come up with proxy metrics which will guide our work without incurring the cost of doing end to end experimentation at every step. In addition, metrics are expensive - rigorously executing and delivering measurements is a skill required of second generation big data scientists.

The Data Wallow: While I'm not yet happy with this name, the basic concept is that in addition to the standard meetings and behaviours of agile engineering, we have the teams spend scheduled time together walling in the data. The purpose of this is two fold: firstly, it is vital that a data team be intimate with the data they are working with and the data products they are producing - the wallow provides shared data accountability. Secondly, you simply don't know what you will find in the data and how it will impact your design and planning decisions. The wallow provides a team experience which will directly impact sprint / iteration planning.

Related articles 5 Hidden Skills for Big Data Scientists

O Knoweldge Graph, Where Art Thou?

Mon, 02/11/2013 - 04:50

Categories:

Search

The web search community, in recent months and years, has heard quite a bit about the 'knowledge graph'. The basic concept is reasonably straightforward - instead of a graph of pages, we propose a graph of knowledge where the nodes are atoms of information of some form and the links are relationships between those statements. The knowledge graph concept has become established enough for it to be used as a point of comparison between Bing and Google.

Last night, I went to see a performance of Kodo - regarded internationally as the premier taiko group. A search on Bing for 'kodo' produced the following result:

 

Bing showed good results for the web and images as well as a knowledge driven portion of the answer from wikipedia with links to play some of their songs. Not bad - but no mention of the performance.

As Kodo were performing at Meany Hall on the University of Washington campus, I did another search on Bing for the venue:

Here we see something better - the venue is recognized as a venue and consequently joined with the events that are known to Bing, including the concert I was attending. As the event information included a link to the performer (the blue Kodo link in the screen shot) I followed through and found Bing gave me event information.


In these interactions, we can see part of the promise of the knowledge graph, but many areas for improvements. The event node relates the performer to the venue to the event. However the venue information in this part of the graph is isolated from that used to deliver the result for the query purely about the venue (note that the addresses are different - a common problem with campus and mall-like areas). The above experience, I think, shows the true challenge of the knowledge graph proposition - bringing all the isolated data graphs together correctly when the nodes in the graphs are actually representations of the same real world entities.

Note that in exploring this particular scenario, Bing appeared to be doing a little better than Google, though Google had partial event information associated with the Kodo entity.


As these names are possibly taken from the listings information from different sources, the name of the performer is confusingly presented in different forms.

Much of what we see out there in the form of knowledge returned for searches is really isolated pockets of related information (the date and place of brith of a person, for example). The really interesting things start happening when the graphs of information become unified across type, allowing - as suggested by this example - the user to traverse from a performer to a venue to all the performers at that venue, etc. Perhaps 'knowledge engineer' will become a popular resume-buzz word in the near future as 'data scientest' has become recently.

 

Participation and Observation in Search

Sat, 02/09/2013 - 01:47

Categories:

Search

The early days of web search were essentially about observation. The web search engine observed the web (documents, links and user behaviours) and then delivered results based on those observations.

In recent years we have started to see more of a position of participation in web search engines. Examples of participation include:

  • Hosting web sites for businesses - by getting their data on the web more useful targets are provided for user and a short loop is developed with the source of accurate data, i.e. the business.
  • Providing feed proxy services (like feedburner) - by providing a service to bloggers, the search engine gets access to valuable user information.
  • Hosting content - by hosting news articles and blogs directly, the search engine gets real time updates to content first as well as direct access to user behaviour.
  • Exposing data editing tools like map editors - by offering crowd sourcing tools the search engine benefits the community by improving data and is the first to know about and leverage that fresh information.

Participation looks like a core strategy for search.