Data Mining: Text Mining, Visualization and Social Media

Subscribe to Data Mining: Text Mining, Visualization and Social Media feed
Updated: 2 weeks 1 day ago

Bing hearts World Cup 2014, Google - not so much

Sat, 07/12/2014 - 19:19

Categories:

Search

While Google has been doing a great job of their front page animations (today's is very nice, illustrating how Brazil and The Netherlands are on their way to Russia for 2018), Bing appears to be far more attentive to actually answering questions about the competition. For example:

Compared to Bing's

Google's answer brings up some interesting news articles, but Bing brings up stats on the teams and even a prediction of who will win (Cortana - which is driving these predictions - has been doing a perfect job of predicting game outcomes).

GrubHub's Phasmid Websites

Sun, 05/04/2014 - 04:49

Categories:

Search

The rationale behind mining business data directly from the business's own website is that the business has a clear economic motivation to ensure that the data is up to date. If you own a restaurant that changes location, and your website still publishes the former address, those potential customers who visit your site will not be enjoying your delicious offerings.

For the web mining proposition to work, it is important to firstly know that you have in your hand a genuine business website and secondly, to have excellent extraction and inference technology to pull the required data from the HTML.

The first requirement can get pretty murky. There are sites that could easily be mistaken for a business site but which are, in fact, other types of legitimate sites (such as a blog with the contact information of the blogger). Unfortunately, there are also sites which are essentially fake store fronts for the actual business in question. The most obvious are those sites which are simply parked domains with some spam links on them. A domain parker might snap up the domain yourrestaurantseattle.com hoping that web surfers looking for yourrestaurant will land there and give them some clicks. An emerging new trend is a far more sophisticated site which (through some amount of templating but also specific editorial attention) aims to look like the actual site of the business. The motivation for these sites is the burgeoning third party restaurant delivery service industry - for which GrubHub might be the poster child.

Take, for example, the 1947 Tavern in Pittsburgh. This establishment's site is located on the web at http://www.1947tavern.com/. GrubHub, however, has set up a phasmid site at http://1947tavernpittsburgh.com/ which looks like a legitimate home page for the tavern.

There is, however, a modest amount of GrubHub visibility on this site, including the link to the menu as well as references to the tavern's status in the GrubHub universe.

Inspecting the domain's registration data in the whois database shows that it was actually registered by GrubHub. This brings up the following entry as the email contact:

Registrant Email: @GRUBHUB.COM (here http://www.whois.com/whois/1947tavernpittsburgh.com)

I can't find the email address ghwebsites at grubhub.com on the GrubHub website, but a site search on Google over the Web Analyzer domain (site:wa-com.com "ghwebsite at grubhub.com") produces 13, 200 results. Dipping in to these brings up further examples of fake sites made with the same template as that for the 1947 Tavern phasmid.

[In the above, I substitie ' at ' for the '@' to avoid typepad's automated obsfucation of email addresses in posts.]

Here's another example: http://indiagardenmonroeville.com/.

I don't believe these sites are particularly malicious - most likely, they bring additional customers to the business even if it is through deception. They do, however, pose a problem for web mining systems. There is less pressure on GrubHub to keep the exact details of the business up to date. In addition, when GrubHub goes belly up, these sites will linger.