Topic Maps

Installing Distributed Solr 4 with Fabric

Another word for it2 hours 11 min ago

Categories:

Topic Maps

Installing Distributed Solr 4 with Fabric by Martijn Koster

From the post:

Solr 4 has a subset of features that allow it be run as a distributed fault-tolerant cluster, referred to as “SolrCloud”. Installing and configuring Solr on a multi-node cluster can seem daunting when you’re a developer who just wants to give the latest release a try. The wiki page is long and complex, and configuring nodes manually is laborious and error-prone. And while your OS has ZooKeeper/Solr packages, they are probably outdated. But it doesn’t have to be a lot of work: in this post I will show you how to deploy and test a Solr 4 cluster using just a few commands, using mechanisms you can easily adjust for your own deployments.

I am using a cluster consisting of a virtual machines running Ubuntu 12.04 64bit and I am controlling them from my MacBook Pro. The Solr configuration will mimic the Two shard cluster with shard replicas and zookeeper ensemble example from the wiki.

You can run this on AWS EC2, but some special considerations apply, see the footnote.

We’ll use Fabric, a light-weight deployment tool that is basically a Python library to easily execute commands on remote nodes over ssh. Compared to Chef/Puppet it is simpler to learn and use, and because it’s an imperative approach it makes sequential orchestration of dependencies more explicit. Most importantly, it does not require a separate server or separate node-side software installation.

DISCLAIMER: these instructions and associated scripts are released under the Apache License; use at your own risk.

I strongly recommend you use disposable virtual machines to experiment with.

Something to get you excited about the upcoming weekend!

Enjoy!

MongoDB: The Definitive Guide 2nd Edition is Out!

Another word for it5 hours 46 min ago

Categories:

Topic Maps

MongoDB: The Definitive Guide 2nd Edition is Out! by Kristina Chodorow.

From the webpage:

The second edition of MongoDB: The Definitive Guide is now available from O’Reilly! It covers both developing with and administering MongoDB. The book is language-agnostic: almost all of the examples are in JavaScript.

Looking forward to enjoying the second edition as much as the first!

Although, I am not really sure that always using JavaScript means you are “language-agnostic.”

Probabilistic Programming and Bayesian Methods for Hackers

Another word for it9 hours 11 min ago

Categories:

Topic Maps

Probabilistic Programming and Bayesian Methods for Hackers by Cam Davidson-Pilon and others.

From the website:

The Bayesian method is the natural approach to inference, yet it is hidden from readers behind chapters of slow, mathematical analysis. The typical text on Bayesian inference involves two to three chapters on probability theory, then enters what Bayesian inference is. Unfortunately, due to mathematical intractability of most Bayesian models, the reader is only shown simple, artificial examples. This can leave the user with a so-what feeling about Bayesian inference. In fact, this was the author’s own prior opinion.

After some recent success of Bayesian methods in machine-learning competitions, I decided to investigate the subject again. Even with my mathematical background, it took me three straight-days of reading examples and trying to put the pieces together to understand the methods. There was simply not enough literature bridging theory to practice. The problem with my misunderstanding was the disconnect between Bayesian mathematics and probabilistic programming. That being said, I suffered then so the reader would not have to now. This book attempts to bridge the gap.

If Bayesian inference is the destination, then mathematical analysis is a particular path to towards it. On the other hand, computing power is cheap enough that we can afford to take an alternate route via probabilistic programming. The latter path is much more useful, as it denies the necessity of mathematical intervention at each step, that is, we remove often-intractable mathematical analysis as a prerequisite to Bayesian inference. Simply put, this latter computational path proceeds via small intermediate jumps from beginning to end, where as the first path proceeds by enormous leaps, often landing far away from our target. Furthermore, without a strong mathematical background, the analysis required by the first path cannot even take place.

Bayesian Methods for Hackers is designed as a introduction to Bayesian inference from a computational/understanding-first, and mathematics-second, point of view. Of course as an introductory book, we can only leave it at that: an introductory book. For the mathematically trained, they may cure the curiosity this text generates with other texts designed with mathematical analysis in mind. For the enthusiast with less mathematical-background, or one who is not interested in the mathematics but simply the practice of Bayesian methods, this text should be sufficient and entertaining.

Not yet complete but what is there you will find very useful.

Data Science eBook by Analyticbridge – 2nd Edition

Another word for it10 hours 43 min ago

Categories:

Topic Maps

Data Science eBook by Analyticbridge – 2nd Edition by Vincent Granville.

From the post:

This 2nd edition has more than 200 pages of pure data science, far more than the first edition. This new version of our very popular book will soon be available for download: we will make an announcement when it is officially published.

Sixty-two (62) new contributions split between data science recipes, data science discussions, data science resources.

If you can’t wait for the ebook, links to the contributions are given at Vincent’s post.

One post in particular caught my attention: How to reverse engineer Google?

The project sounds interesting but why not reverse engineer CNN or WSJ or NYT coverage?

Watch the stories that appear most often and the most visibly to determine what you need to do for coverage.

It may not have anything to do with your core competency, but then neither does gaming page rankings by Google.

Just that is your business model and then you are selling your service to people even less informed than you are.

Do be careful because some events covered by CNN, WSJ and the NTY are considered illegal in some jurisdictions.

Subway Maps and Visualising Social Equality

Another word for itWed, 05/22/2013 - 23:45

Categories:

Topic Maps

Subway Maps and Visualising Social Equality by James Chesire.

From the post:

Most government statistics are mapped according to official geographical units. Whilst such units are essential for data analysis and making decisions about, for example, government spending, they are hard for many people to relate to and they don’t particularly stand out on a map. This is why I tried a new method back in July 2012 to show life expectancy statistics in a fresh light by mapping them on to London Tube stations. The resulting ”Lives on the Line” map has been really popular with many people surprised at the extent of the variations in the data across London and also grateful for the way that it makes seemingly abstract statistics more easily accessible. To find out how I did it (and read some of the feedback) you can see here.

James gives a number of examples of the use of transportation lines making “abstract statistics more easily accessible.”

Worth a close look if you are interested in making dry municipal statistics part of the basis for social change.

US rendition map: what it means, and how to use it

Another word for itWed, 05/22/2013 - 23:05

Categories:

Topic Maps

US rendition map: what it means, and how to use it by James Ball.

From the post:

The Rendition Project, a collaboration between UK academics and the NGO Reprieve, has produced one of the most detailed and illuminating research projects shedding light on the CIA’s extraordinary rendition project to date. Here’s how to use it.

Truly remarkable project to date, but could be even more successful with your assistance.

Not likely that any of the principals will wind up in the dock at the Hague.

On the other hand, exposing their crimes may deter others from similar adventures.

Integrating the US’ Documents

Another word for itWed, 05/22/2013 - 21:42

Categories:

Topic Maps

Integrating the US’ Documents by Eric Mill.

From the post:

A few weeks ago, we integrated the full text of federal bills and regulations into our alert system, Scout. Now, if you visit CISPA or a fascinating cotton rule, you’ll see the original document – nicely formatted, but also well-integrated into Scout’s layout. There are a lot of good reasons to integrate the text this way: we want you to see why we alerted you to a document without having to jump off-site, and without clunky iframes.

As importantly, we wanted to do this in a way that would be easily reusable by other projects and people. So we built a tool called us-documents that makes it possible for anyone to do this with federal bills and regulations. It’s available as a Ruby gem, and comes with a command line tool so that you can use it with Python, Node, or any other language. It lives inside the unitedstates project at unitedstates/documents, and is entirely public domain.

This could prove to be real interesting. Both as a matter of content and a technique to replicate elsewhere.

I first saw this at: Mill: Integrating the US’s Documents.

DCAT Application Profile for Data Portals in Europe – Final Draft

Another word for itWed, 05/22/2013 - 21:27

Categories:

Topic Maps

DCAT Application Profile for Data Portals in Europe – Final Draft

From the post:

The DCAT Application profile for data portals in Europe (DCAT-AP) is a specification based on the Data Catalogue vocabulary (DCAT) for describing public sector datasets in Europe. Its basic use case is to enable a cross-data portal search for data sets and make public sector data better searchable across borders and sectors. This can be achieved by the exchange of descriptions of data sets among data portals.

This final draft is open for public review until 10 June 2013. Members of the public are invited to download the specification and post their comments directly on this page. To be able to do so you need to be registered and logged in.

If you are interested in integration of data from European data portals, it is worth the time to register, etc.

Not all the data you are going to need to integrate a data set but at least a start in the right direction.

Open Access to Weather Data for International Development

Another word for itWed, 05/22/2013 - 20:28

Categories:

Topic Maps

Open Access to Weather Data for International Development

From the post:

Farming communities in Africa and South Asia are becoming increasingly vulnerable to shock as the effects of climate change become a reality. This increased vulnerability, however, comes at a time when improved technology makes critical information more accessible than ever before. aWhere Weather, an online platform offering free weather data for locations in Western, Eastern and Southern Africa and South Asia provides instant and interactive access to highly localized weather data, instrumental for improved decision making and providing greater context in shaping policies relating to agricultural development and global health.

Weather Data in 9km Grid Cells

Weather data is collected at meteorological stations around the world and interpolated to create accurate data in detailed 9km grids. Within each cell, users can access historical, daily-observed and 8 days of daily forecasted ‘localized’ weather data for the following variables:

  • Precipitation 
  • Minimum and Maximum Temperature
  • Minimum and Maximum Relative Humidity 
  • Solar Radiation 
  • Maximum and Morning Wind Speed
  • Growing degree days (dynamically calculated for your base and cap temperature) 

These data prove essential for risk adaption efforts, food security interventions, climate-smart decision making, and agricultural or environmental research activities.

Sign up Now

Access is free and easy. Register at http://www.awhere.com/en-us/weather-p. Then, you can log back in anytime at me.awhere.com.  

For questions on the platform, please contact weather@awhere.com

At least as a public observer, I could not determine how much “interpolation” is going to the weather data. That would have a major impact on the risk of accepting the data provided at face value.

I suspect it varies from little interpolation at all in heavily instrumented areas to quite a bit in areas with sparser readings. How much is unclear.

It maybe that the amount of interpolation in the data is a factor of whether you use the free version or some upgraded commercial version.

Still, an interesting data source to combine with others, if you are mindful of the risks.

Introduction to Artificial Intelligence (Berkeley CS188.1x)

Another word for itWed, 05/22/2013 - 19:17

Categories:

Topic Maps

Introduction to Artificial Intelligence (Berkeley CS188.1x)

The schedule for CS188.2x hasn’t been announced, yet.

In the meantime, you can register for CS188.1x and peruse the videos, exercises, etc. while you wait for the second part of the course.

From the description:

CS188.1x is a new online adaptation of the first half of UC Berkeley’s CS188: Introduction to Artificial Intelligence. The on-campus version of this upper division computer science course draws about 600 Berkeley students each year.

Artificial intelligence is already all around you, from web search to video games. AI methods plan your driving directions, filter your spam, and focus your cameras on faces. AI lets you guide your phone with your voice and read foreign newspapers in English. Beyond today’s applications, AI is at the core of many new technologies that will shape our future. From self-driving cars to household robots, advancements in AI help transform science fiction into real systems.

CS188.1x focuses on Behavior from Computation. It will introduce the basic ideas and techniques underlying the design of intelligent computer systems. A specific emphasis will be on the statistical and decision–theoretic modeling paradigm. By the end of this course, you will have built autonomous agents that efficiently make decisions in stochastic and in adversarial settings. CS188.2x (to follow CS188.1x, precise date to be determined) will cover Reasoning and Learning. With this additional machinery your agents will be able to draw inferences in uncertain environments and optimize actions for arbitrary reward structures. Your machine learning algorithms will classify handwritten digits and photographs. The techniques you learn in CS188x apply to a wide variety of artificial intelligence problems and will serve as the foundation for further study in any application area you choose to pursue.

Dynamic faceting with Lucene

Another word for itWed, 05/22/2013 - 19:08

Categories:

Topic Maps

Dynamic faceting with Lucene by Michael McCandless.

From the post:

Lucene’s facet module has seen some great improvements recently: sizable (nearly 4X) speedups and new features like DrillSideways. The Jira issues search example showcases a number of facet features. Here I’ll describe two recently committed facet features: sorted-set doc-values faceting, already available in 4.3, and dynamic range faceting, coming in the next (4.4) release.

To understand these features, and why they are important, we first need a little background. Lucene’s facet module does most of its work at indexing time: for each indexed document, it examines every facet label, each of which may be hierarchical, and maps each unique label in the hierarchy to an integer id, and then encodes all ids into a binary doc values field. A separate taxonomy index stores this mapping, and ensures that, even across segments, the same label gets the same id.

At search time, faceting cost is minimal: for each matched document, we visit all integer ids and aggregate counts in an array, summarizing the results in the end, for example as top N facet labels by count.

This is in contrast to purely dynamic faceting implementations like ElasticSearch‘s and Solr‘s, which do all work at search time. Such approaches are more flexible: you need not do anything special during indexing, and for every query you can pick and choose exactly which facets to compute.

However, the price for that flexibility is slower searching, as each search must do more work for every matched document. Furthermore, the impact on near-real-time reopen latency can be horribly costly if top-level data-structures, such as Solr’s UnInvertedField, must be rebuilt on every reopen. The taxonomy index used by the facet module means no extra work needs to be done on each near-real-time reopen.

The dynamic range faceting sounds particularly useful.

30 Days to Data Storytelling

Another word for itWed, 05/22/2013 - 18:07

Categories:

Topic Maps

30 Days to Data Storytelling by James Lytle.

From the post:

We learn best from a diversity of inputs. That’s partly why our previous 30 days exercise sheet was such a huge hit.

It’s critical for analysts and presenters of data to share information in a way that people just get it. Enter data storytelling – a magical elixir to all your data communication woes! Well, maybe not quite. But you should be aware of recent efforts using this timeless approach to deliver information so naturally – through stories.

That’s why we’ve created 30 Days to Data Storytelling.

This exercise breaks down a structured (yet casual) introduction to data storytelling through a variety resources. We wanted to provide a diversity of depth and inspiration. Feel free to skip around or follow our 4 week sequence. Print it and post it near the water cooler or slap it to your virtual desktop.

I don’t have a water cooler but I will post “30 Days to Data Storytelling” next to my monitors.

Whatever the subject, knowledge you can’t communicate to others, is lost.

Hadoop, Hadoop, Hurrah! HDP for Windows is Now GA!

Another word for itTue, 05/21/2013 - 21:54

Categories:

Topic Maps

Hadoop, Hadoop, Hurrah! HDP for Windows is Now GA! by John Kreisa.

From the post:

Today we are very excited to announce that Hortonworks Data Platform for Windows (HDP for Windows) is now generally available and ready to support the most demanding production workloads.

We have been blown away with the number and size of organizations who have downloaded the beta bits of this 100% open source, and native to Windows distribution of Hadoop and engaged Hortonworks and Microsoft around evolving their data architecture to respond to the challenges of enterprise big data.

With this key milestone HDP for Windows offers the millions of customers running their business on Microsoft technologies an ecosystem-friendly Hadoop-based solution that is built for the enterprise and purpose built for Windows. This release cements Apache Hadoop’s role as a key component of the next generation enterprise data architecture, across the broadest set of datacenter configurations as HDP becomes the first production-ready Apache Hadoop distribution to run on both Windows and Linux.

Additionally, customers now also have complete portability of their Hadoop applications between on-premise and cloud deployments via HDP for Windows and Microsofts’s HDInsight Service.

Two lessons here:

First, Hadoop is a very popular way to address enterprise big data.

Second, going where users are, not where they ought to be, is a smart business move.

JSME: a free molecule editor in JavaScript

Another word for itTue, 05/21/2013 - 21:48

Categories:

Topic Maps

JSME: a free molecule editor in JavaScript by Bruno Bienfait and Peter Ertl. (Journal of Cheminformatics 2013, 5:24 doi:10.1186/1758-2946-5-24)

Abstract:

Background

A molecule editor, i.e. a program facilitating graphical input and interactive editing of molecules, is an indispensable part of every cheminformatics or molecular processing system. Today, when a web browser has become the universal scientific user interface, a tool to edit molecules directly within the web browser is essential. One of the most popular tools for molecular structure input on the web is the JME applet. Since its release nearly 15 years ago, however the web environment has changed and Java applets are facing increasing implementation hurdles due to their maintenance and support requirements, as well as security issues. This prompted us to update the JME editor and port it to a modern Internet programming language – JavaScript.

Summary

The actual molecule editing Java code of the JME editor was translated into JavaScript with help of the Google Web Toolkit compiler and a custom library that emulates a subset of the GUI features of the Java runtime environment. In this process, the editor was enhanced by additional functionalities including a substituent menu, copy/paste, drag and drop and undo/redo capabilities and an integrated help. In addition to desktop computers, the editor supports molecule editing on touch devices, including iPhone, iPad and Android phones and tablets. In analogy to JME the new editor is named JSME. This new molecule editor is compact, easy to use and easy to incorporate into web pages.

Conclusions

A free molecule editor written in JavaScript was developed and is released under the terms of permissive BSD license. The editor is compatible with JME, has practically the same user interface as well as the web application programming interface. The JSME editor is available for download from the project web page http://peter-ertl.com/jsme/

Just in case you were having any doubts about using JavaScript to power an annotation editor.

Better now?

Consumers of Furry Pornography = Tax Dodgers?

Another word for itMon, 05/20/2013 - 22:00

Categories:

Topic Maps

No more heatmaps that are just population maps! by Pete Warden.

From the post:

I'm pleased to announce that there's a brand new 0.50 version of the DSTK out! It has a lot of bug fixes, and a couple of major new features, and you can get it on Amazon's EC2 as ami-7b9df412, download the Vagrant box from http://static.datasciencetoolkit.org/dstk_0.50.box, or grab it as a BitTorrent stream from http://static.datasciencetoolkit.org/dstk_0.50.torrent

What are the new features?

The biggest is the integration of high resolution (sub km-squared) geostatistics for the entire globe. You can get population density, elevation, weather and more using the new coordinates2statistics API call. Why is this important? No more heatmaps that are just population maps, for the love of god! I'm using this extensively to normalize my data analysis so that I can actually tell which places actually have an unusually high occurrence of X, rather than just having more people.

If you use the DSTK (and you should), do send Pete a note of appreciation.

I can’t wait to start mapping tax dodgers!

U.S. Senate Panel Discovers Nowhere Man [Apple As Tax Dodger]

Another word for itMon, 05/20/2013 - 21:47

Categories:

Topic Maps

Forty-seven years after Nowhere Man by the Beatles, a U.S. Senate panel discovers several nowhere men.

A Wall Street Journal Technology Alert:

Apple has set up corporate structures that have allowed it to pay little or no corporate tax–in any country–on much of its overseas income, according to the findings of a U.S. Senate examination.

The unusual result is possible because the iPhone maker’s key foreign subsidiaries argue they are residents of nowhere, according to the investigators’ report, which will be discussed at a hearing Tuesday where Apple CEO Tim Cook will testify. The finding comes from a lengthy investigation into the technology giant’s tax practices by the Senate Permanent Subcommittee on Investigations, led by Sens. Carl Levin (D., Mich.) and John McCain (R., Ariz.).

In additional coverage, Apple says:

Apple’s testimony also includes a call to overhaul: “Apple welcomes an objective examination of the US corporate tax system, which has not kept pace with the advent of the digital age and the rapidly changing global economy.”

Tax reform will be useful only if “transparent” tax reform.

Transparent tax reform mean every provision with more than a $100,000 impact on any taxpayer, names all the taxpayers impacted. Whether more or less taxes.

We have the data, we need the will to apply the analysis.

A tax-impact topic map anyone?

The Index-Based Subgraph Matching Algorithm (ISMA)…

Another word for itMon, 05/20/2013 - 21:23

Categories:

Topic Maps

The Index-Based Subgraph Matching Algorithm (ISMA): Fast Subgraph Enumeration in Large Networks Using Optimized Search Trees by Sofie Demeyer, Tom Michoel, Jan Fostier, Pieter Audenaert, Mario Pickavet, Piet Demeester. (Demeyer S, Michoel T, Fostier J, Audenaert P, Pickavet M, et al. (2013) The Index-Based Subgraph Matching Algorithm (ISMA): Fast Subgraph Enumeration in Large Networks Using Optimized Search Trees. PLoS ONE 8(4): e61183. doi:10.1371/journal.pone.0061183)

Abstract:

Subgraph matching algorithms are designed to find all instances of predefined subgraphs in a large graph or network and play an important role in the discovery and analysis of so-called network motifs, subgraph patterns which occur more often than expected by chance. We present the index-based subgraph matching algorithm (ISMA), a novel tree-based algorithm. ISMA realizes a speedup compared to existing algorithms by carefully selecting the order in which the nodes of a query subgraph are investigated. In order to achieve this, we developed a number of data structures and maximally exploited symmetry characteristics of the subgraph. We compared ISMA to a naive recursive tree-based algorithm and to a number of well-known subgraph matching algorithms. Our algorithm outperforms the other algorithms, especially on large networks and with large query subgraphs. An implementation of ISMA in Java is freely available at http://sourceforge.net/projects/isma/.

From the introduction:

Over the last decade, network theory has come to play a central role in our understanding of complex systems in fields as diverse as molecular biology, sociology, economics, the internet, and others [1]. The central question in all these fields is to understand behavior at the level of the whole system from the topology of interactions between its individual constituents. In this respect, the existence of network motifs, small subgraph patterns which occur more often in a network than expected by chance, has turned out to be one of the defining properties of real-world complex networks, in particular biological networks [2]. Network motifs act as the fundamental information processing units in cellular regulatory networks [3] and they form the building blocks of larger functional modules (also known as network communities) [4]–[6]. The discovery and analysis of network motifs crucially depends on the ability to enumerate all instances of a given query subgraph in a network or graph of interest, a classical problem in pattern recognition [7], that is known to be NP complete [8].

Heavy sledding but important for exploration of large graphs/networks and the subsequent representation of those findings in a topic map.

Reloading my Beergraph – using an in-graph-alcohol-percentage-index

Another word for itMon, 05/20/2013 - 20:27

Categories:

Topic Maps

Reloading my Beergraph – using an in-graph-alcohol-percentage-index by Rik Van Bruggen.

From the post:

As you may remember, I created a little beer graph some time ago to experiment and have fun with beer, and graphs. And yes, I have been having LOTS of fun with it – using it to explain graph concepts to lots of not-so-technical folks, like myself. Many people liked it, and even more people had some questions about it – started thinking in graphs, basically. Which is way more than what I ever hoped for – so that’s great!

One of the questions that people always asked me was about the model. Why did I model things the way I did? Are there no other ways to model this domain? What would be the *best* way to model it? All of these questions have somewhat vague answers, because as a rule, there is no *one way* to model a graph. The data does not determine the model – it’s the QUERY that will drive the modelling decisions.

One of the things that spurred the discussion was – probably not coincidentally – the AlcoholPercentage. Many people were expecting that to be a *property* of the Beerbrand – but instead in my beergraph, I had “pulled it out”. The main reason at the time was more coincidence than anything else, but when you think of it – it’s actually a fantastic thing to “pull things out” and normalise the data model much further than you probably would in a relational model. By making the alcoholpercentage a node of its own, it allowed me to do more interesting queries and pathfinding operations – which led to interesting beer recommendations. Which is what this is all about, right?

(…)

When I read:

All of these questions have somewhat vague answers, because as a rule, there is no *one way* to model a graph. The data does not determine the model – it’s the QUERY that will drive the modelling decisions.

or

…but instead in my beergraph, I had “pulled it out”. The main reason at the time was more coincidence than anything else, but when you think of it – it’s actually a fantastic thing to “pull things out” and normalise the data model much further than you probably would in a relational model.

I don’t feel like I’ve been vague, ever.

Here is my summary of what Rik may have meant:

  • “no *one way* to model a graph” -> graphs support multiple models of data
  • “The data does not determine the model ” -> may mean you can create any arbitrary model based on any data
  • “…the QUERY that will drive the modeling decisions.” -> in topic map terms, what gets represented by a topic (node in a graph) is what you want to talk about (query)
  • “…pulled it out…”/”…pull things out…” -> represent a subject with a node (graph) or topic (topic maps).
  • “…normlise the data model much further…” -> The distinction from database normalization isn’t clear, may just be filler.
    • Clarity in writing reduces unnecessary vagueness.

FuzzyLaw [FuzzyDBA, FuzzyRDF, FuzzySW?]

Another word for itMon, 05/20/2013 - 19:03

Categories:

Topic Maps

FuzzyLaw

From the webpage:

(…)

FuzzyLaw has gathered explanations of legal terms from members of the public in order to get a sense of what the ‘person on the street’ has in mind when they think of a legal term. By making lay-people’s explanations of legal terms available to interpreters, police and other legal professionals, we hope to stimulate debate and learning about word meaning, public understanding of law and the nature of explanation.

The explanations gathered in FuzzyLaw are unusual in that they are provided by members of the public. These people, all aged over 18, regard themselves as ‘native speakers’, ‘first language speakers’ and ‘mother tongue’ speakers of English and have lived in England and/or Wales for 10 years or more. We might therefore expect that they will understand English legal terminology as well as any member of the public might. No one who has contributed has ever worked in the criminal law system or as an interpreter or translator. They therefore bring no special expertise to the task of explanation, beyond whatever their daily life has provided.

We have gathered explanations for 37 words in total. You can see a sample of these explanations on FuzzyLaw. The sample of explanations is regularly updated. You can also read responses to the terms and the explanations from mainly interpreters, police officers and academics. You are warmly invited to add your own responses and join in the discussion of each and every word. Check back regularly to see how discussions develop and consider bookmarking the site for future visits. The site also contains commentaries on interesting phenomena which have emerged through the site. You can respond to the commentaries too on that page, contributing to the developing research project.

(…)

Have you ever wondered that the ‘person on the street’ thinks about relational databases, RDF or the Semantic Web?

Those are the folks who are being pushed content based on interpretations not their own making.

Here’s a work experiment for you:

  1. Take ten search terms from your local query log.
  2. At each department staff meeting, distribute sheets with the words, requesting everyone to define the terms in their own words. No wrong answers.
  3. Tally up the definitions per department and across the company.
  4. Comments anyone?

I first saw this at: FuzzyLaw: Collection of lay citizens’ understandings of legal terminology.

GraphX: A Resilient Distributed Graph System on Spark

Another word for itMon, 05/20/2013 - 15:23

Categories:

Topic Maps

GraphX: A Resilient Distributed Graph System on Spark by Reynold Xin, Joseph Gonzalez, Michael Franklin, Ion Stoica.

Abstract:

From social networks to targeted advertising, big graphs capture the structure in data and are central to recent advances in machine learning and data mining. Unfortunately, directly applying existing data-parallel tools to graph computation tasks can be cumbersome and inefficient. The need for intuitive, scalable tools for graph computation has lead to the development of new graph-parallel systems (e.g. Pregel, PowerGraph) which are designed to efficiently execute graph algorithms. Unfortunately, these new graph-parallel systems do not address the challenges of graph construction and transformation which are often just as problematic as the subsequent computation. Furthermore, existing graph-parallel systems provide limited fault-tolerance and support for interactive data mining.

We introduce GraphX, which combines the advantages of both data-parallel and graph-parallel systems by efficiently expressing graph computation within the Spark data-parallel framework. We leverage new ideas in distributed graph representation to efficiently distribute graphs as tabular data-structures. Similarly, we leverage advances in data-flow systems to exploit in-memory computation and fault-tolerance. We provide powerful new operations to simplify graph construction and transformation. Using these primitives we implement the PowerGraph and Pregel abstractions in less than 20 lines of code. Finally, by exploiting the Scala foundation of Spark, we enable users to interactively load, transform, and compute on massive graphs.

Of particular note is the use of an immutable graph as the core data structure for GraphX.

The authors report that GraphX performs less well than PowerGraph (GraphLab 2.1) but promise performance gains and offsetting gains in productivity.

I didn’t find any additional resources at AMPLab on GraphX but did find:

Spark project homepage, and,

Screencasts on Spark

Both will benefit you when more information emerges on GraphX.

Subscribe to The Universal Pantograph aggregator - Topic Maps