RDF

Special Issue on Web Data Quality in IJSWIS

Planet RDFFri, 06/14/2013 - 13:55

Categories:

RDF

Call for papers
Special Issue on Web Data Quality
International Journal on Semantic Web and Information Systems

Scope:

The standardization and adoption of Semantic Web technologies has resulted in an unprecedented volume of data being published as Linked Open Data (LOD). The integration across this Web of Data, however, is hampered by the ‘publish first, refine later’philosophy. This leads to various quality problems arising in the underlying data such as incompleteness, inconsistency and incomprehensibility. These problems affect every application domain, be it scientific (e.g., life science, environment), governmental or industrial applications.

This Special Issue is addressed to those members of the community interested in providing novel methodologies or frameworks in assessing, monitoring, maintaining and improving the quality of the Web of Data and also introduce tools and user interfaces which can effectively assist in the assessment. The benefits of such methodologies will not only help in detecting inherent data quality problems currently plaguing the Web of Data, but also provide the means to fix these problems and maintain the quality in the long run. Additionally, we also seek articles that help identify the current impediments in building real-world LOD applications

Topics:

  • Web data and LOD quality concepts
  • Data quality dimensions and metrics for Web data and LOD quality
  • Web data and LOD quality methodologies
  • Data quality assessment frameworks
  • Evaluation of quality and trustworthiness in the web of data
  • (Semi-)automatic assessment in the web of data
  • Large-scale quality assessment of structured datasets
  • Validation of currently existing data quality assessment methodologies
  • Use-case driven quality assessment
  • Quality assessment leveraging background knowledge
  • Co-reference detection and dataset reconciliation
  • Data quality methodologies for linked open data
  • Evaluating quality of ontologies
  • Web data and LOD quality tools
  • Design and implementation of data quality monitoring, assessment and improvement tools
  • Quality exploration and analysis interfaces
  • Scalability and performance of tools
  • Monitoring tools
  • Case studies on Web data and LOD quality assessment and improvement
  • Web data and LOD quality benchmarks
  • Issues in LOD
  • Methods to acquire most relevant LOD datasets
  • Generating meaningful associations across LOD datasets

structFieldStorage: A New Field Storage System for Drupal 7

Planet RDFTue, 06/11/2013 - 03:41

Categories:

RDF

Structured Dynamics has been working with Drupal for quite some time. This week marks our third anniversary of posting code to the contributed conStruct modules in Drupal. But, what I’m able to share today is our most exciting interaction with Drupal to date. In essence, we now can run Drupal directly from an RDF triplestore and take full advantage of our broader Open Semantic Framework (OSF) stack. Massively cool!

On a vanilla Drupal 7 instance, everything ends up being saved into Drupal’s default storage system. This blog post introduces a new way to save (local) Content Type entities: the structfieldstorage field storage system. This new field storage system gives the possibility to Drupal administrators to choose to save specific (or all) fields and their values into a remote structWSF instance. This option replaces Drupal’s default storage system (often MySQL) for the content types and their fields chosen.

By using this new field storage system, all of the local Drupal 7 content can be queried via any of structWSF’s web service endpoints (which includes a SPARQL endpoint). This means that all Drupal 7 content (using this new storage system) gets converted and indexed as RDF data. This means that all of the Drupal local content gets indexed in a semantic web service framework.

Fields and Bundles

There are multiple core concepts in Drupal, two of which are Bundles and Fields. A Field is basically an attribute/value tuple that describes an entity. A Bundle is a set (an aggregation) of fields. The main topic of this blog post is a special feature of the field: their storage system.

In Drupal, each field instance does have its own field storage system associated to it. A field storage system is a system that manages the field/value tuples of each entity that has been defined as a Drupal instance. The default storage system of any field is the field_sql_storage, which is normally a MySQL server or database.

The field storage system allows a bundle to have multiple field instances, each of which may have a different field storage target. This means that the data that describes an entity can be saved in multiple different data stores. Though it may appear odd at first as to why such flexibility has merit, but we will see that this design is quite clever, and probably essential.

There are currently a few other field storage systems that have been developed for Drupal 7 so far. The most cited one is probably the MongoDB module, and there is also Riak. What I am discussing in this blog post is a new field storage system for Drupal 7 which uses structWSF as the data store. This new module is called the structFieldStorage module and it is part of conStruct.

Flexibility of the Field Storage API design

The design of having one field storage system per field is really flexible and probably essential. By default, all of the field widgets and all the modules have been created using the field_sql_storage system. This means that a few things here and there have been coded with the specificities of that field storage system. The result is that even if the Field Storage API has been designed and developed such that we can create new field storage systems, the reality is that once you do it, multiple existing field widgets and modules can break from the new field storage systems.

What the field storage system developer has to do is to test all the existing (core) field widgets and modules and make sure to handle all the specifics of these widgets and modules within the field storage system. If it cannot handle a specific widget or module, it should restrict their usage and warn the user.

However, there are situations where someone may require the use of a specific field widget that won’t work with that new field storage system. Because of the flexibility of the design, we can always substitute the field_sql_storage system for the given field dependent on that special widget. Under this circumstance, the values of that field widget would be saved in the field_sql_storage system (MySQL) while the other fields would save their value in a structWSF instance. Other circumstances may also warrant this flexibility.

structFieldStorage Architecture

Here is the general architecture for the structFieldStorage module. The following schema shows how the Drupal Field Storage API Works, and shows the flexibility that resides into the fields, and how multiple fields, all part of the same bundle, can use different storage systems:

By default, on a vanilla Drupal instance, all the fields use the field_sql_storage field storage system:

Here is what that same bundle looks like when all fields use the structfieldstorage field storage system:

Finally here is another schema that shows the interaction between Drupal’s core API, structFieldStorage and the structWSF web service endpoints:

Synchronization

Similar to the default MySQL field_sql_storage system, we have to take into account a few synchronization use cases when dealing with the structfieldstorage storage system for the Drupal content types.

Synchronization with structFieldStorage occurs when fields and field instances that use the structfieldstorage storage system get deleted from a bundle or when an RDF mapping changes. These situations don’t appear often once a portal is developed and properly configured. However, since things evolve all the time, the synchronization mechanism is always available to handle deleted content or changed schema.

The synchronization workflow answers the following questions:

  • What happens when a field get deleted in a content type?
  • What happens when a field’s RDF mapping changes for a new property?
  • What happens when a bundle’s type RDF mapping changes for a new one?

Additionally, if new field instances are being created in a bundle, no synchronization of any kind is required. Since this is a new field, there is necessarily no data for this field in the OSF, so we just wait until people start using this new field to commit new data in the OSF.

The current synchronization heuristics follow the following steps:

  1. Read the structfieldstorage_pending_opts_fields table and get all the un-executed synchronization change operations
    1. For each un-executed change:
      1. Get 20 records within the local content dataset from the Search endpoint. Filter the results to get all the entities that would be affected by the current change
        1. Do until the Search query returns 0 results
          1. For each record within that list
            1. Apply the current change to the entities
            2. Save that modified entities into the OSF using the CRUD: Update web service endpoint
      2. When the Search query returns 0 results, it means that this change got fully applied to the OSF. The state of this change record then get marked as executed.
  2. Read the structfieldstorage_pending_opts_bundles table and get all the un-executed synchronization change operations
    1. For each un-executed change:
      1. Get 20 records within the local content dataset from the Search endpoint. Filter the results to get only the ones that would be affected by the current change
        1. Do until the Search query returns 0 results
          1. For each record within that list
            1. Apply the current change to the entities
            2. Save that changed record into the OSF using the CRUD: Update web service endpoint
      2. When the Search query returns 0 results, it means that this change got fully applied to the OSF. The state of this change record then get marked as executed.

The synchronization process is triggered by a Drupal cron job. Eventually this may be changed to have a setting option that would let you use cron synchronization or to trigger it by hand using some kind of button.

Compatibility

The structFieldStorage module is already compatible with multiple field widgets and external contributed Drupal 7 modules. However, because of Drupal’s nature, other field widgets and contributed modules that are not listed in this section may be working with this new field storage system, but tests will be required by the Drupal system administrator.

Field Widgets

Here is a list of all the core Field Widgets that are normally used by Drupal users. This list tells you which field widget is fully operational or disabled with the structfieldstorage field storage system.

Note that if a field is marked as disabled, it only means that it is not currently implemented for working with this new field storage system. It may be re-enabled in the future if it become required.

Field Type Field Widget Operational? Text Text Field Fully operational Autocomplete for predefined suggestions Fully operational Struct Lookup Fully operational Struct Lookup with suggestion Fully operational Autocomplete for existing field data Disabled Autocomplete for existing field data and some node titles Disabled Term Reference Autocomplete term widget (tagging) Disabled Select list Disabled Check boxes/radio buttons Disabled Long text and summary Text area with a summary Fully operational Long text Text area (multiple rows) Fully operational List (text) Select list Fully operational Check boxes/radio buttons Fully operational Autocomplete for allowed values list Disabled List (integer) Select list Fully operational Check boxes/radio buttons Fully operational Autocomplete for allowed values list Disabled List (float) Select list Fully operational Check boxes/radio buttons Fully operational Autocomplete for allowed values list Disabled Link Link Fully operational Integer Text field Fully operational Float Text field Fully operational Image Image Fully operational File File Fully operational Entity Reference Select list Fully operational Check boxes/radio buttons Fully operational Autocomplete Fully operational Autocomplete (Tags style) Fully operational Decimal Text field Fully operational Date (Unix timestamp) Text field Fully operational Select list Fully operational Pop-up calendar Fully operational Date (ISO format) Text field Fully operational Select list Fully operational Pop-up calendar Fully operational Date Text field Fully operational Select list Fully operational Pop-up calendar Fully operational Boolean Check boxes/radio buttons Fully operational Single on/off checkbox Fully operational Core & Popular Modules Revisioning

The Revisioning module is fully operational with the structfieldstorage field storage system. All the operations exposed in the UI have been handled and implemented in the hook_revisionapi() hook.

Diff

The Diff module is fully operational. Since it compares entity class instances, there is no additional Diff API implementation to do. Each time revisions get compared, then structfieldstorage_field_storage_load() gets called to load the specific entity instances. Then the comparison is done on these entity descriptions.

Taxonomy

The Taxonomy module is not currently supported by the structfieldstorage field storage system. The reason is that the Taxonomy module is relying on the design of the field_sql_storage field storage system, which means that it has been tailored to use that specific field storage system. In some places it can be used, such as with the entity reference field widget, but its core functionality, the term reference field widget, is currently disabled.

Views

structViews is a Views query plugin for querying an OSF backend. It interfaces the Views 3 UI and generates OSF Search queries for searching and filtering all the content it contains. However, Views 3 is intimately tied with the field_sql_storage field storage system, which means that Views 3 itself cannot use the structfieldstorage storage system off the shelf. However, Views 3 design has been created such that a new Views querying engine could be implemented, and used, with the Views 3 user interface. This is no different than how the Field Storage API works for example. This is exactly what structViews is, and this is exactly how we can use Views on all the fields that uses the structfieldstorage field storage system.

This is not different than what is required for the mongodb Drupal module. The mongodb Field Storage API implementation is not working with the default Views 3 functionality either, as shown by this old, and very minimal, mongodb Views 3 integration module.

structViews is already working because all of the information defined in fields that use the structfieldstorage storage system is indexed into the OSF. What structViews does is just to expose this OSF information via the Views 3 user interface. All the fields that define the local content can be added to a structViews view, all the fields can participate into filter criteria, etc.

What our design means is that the structFieldStorage module doesn’t break the Views 3 module. It does not because structViews takes care to expose that entity storage system to Views 3, via the re-implmented API.

efq_views

efq_views is another contributed module that exposes the EntityFieldQuery API to Views 3. What that means is that all of the Field Storage Systems that implement the EntityFieldQuery API should be able to interface with Views 3 via this efq_views Views 3 querying engine.

Right now, the structFieldStorage module does not implement the EntityFieldQueryAPI. However, it could implement it by implementing the hook_field_storage_query() hook. (This was not required by our current client.)

A Better Revisioning System

There is a problem with the core functionality of Drupal’s current revisioning system. The problem is that if a field or a field instance gets deleted from a bundle, then all of the values of those fields, within all of the revisions of the entities that use this bundle, get deleted at the same time.

This means that there is no way to delete a field without deleting the values of that field in existing entities revisions. This is a big issue since there is no way to keep that information, at least for archiving purposes. This is probably working that way because core Drupal developers didn’t want break the feature that enables people to revert an entity to one of its past revisions. This would have meant that data for fields that no longer existed would have to be re-created (creating its own set of issues).

However, for all the fields that uses the structfieldstorage field storage system, this issue is non-existing. Even if fields or fields instances are being deleted, all the past information about these fields remains in the revisions of the entities.

Conclusion

This blog post exposes the internal mechanism of this new structfieldstorage backend to Drupal. The next blog post will focus on the user interface of this new module. It will explain how it can be configured and used. And it will explain the different Drupal backend user interface changes that are needed to expose the new functionality related to this new module.

Leaving Zite

Planet RDFFri, 06/07/2013 - 15:54

Categories:

RDF

Today is my last day working at Zite in San Francisco. The team of engineers is dedicated and I’m sure they’ll continue to innovate and improve – I wish them well.

For myself, although it has been good working in the online news world at Yahoo! – aggregated and created; Digg – social (not the current non-social flavour) and Zite – personalized, it’s time for a new direction for me.

I am taking a couple of weeks off before I head to my next role which is quite Open and Cloudy. That’s a hint.

The LOD cloud is dead, long live the trusted LOD cloud

Planet RDFFri, 06/07/2013 - 12:46

Categories:

RDF

The ongoing debate around the question whether ‘there is money in linked data or not’ has now been formulated more poignantly by Prateek Jain (one of the authors of the original article) recently: He is asking, ‘why linked open data hasn’t been used that much so far besides for research projects?‘.

I believe there are two reasons (amongst others) for the low uptake of LOD in non-academic settings which haven’t been discussed in detail until today:

1. The LOD cloud covers mainly ‘general knowledge‘ in contrast to ‘domain knowledge

Since most organizations live on their internal knowledge which they combine intelligently with very specific (and most often publicly available) knowledge (and data), they would benefit from LOD only if certain domains were covered. A frequently quoted ‘best practice’ for LOD is that portion of data sets which is available at Bio2RDF. This part of the LOD cloud has been used again and again by the life sciences industry due to its specific information and its highly active maintainers.

We need more ‘micro LOD clouds’ like this.

Another example for such is the one which represents the German Library Linked Open Data Cloud (thanks to Adrian Pohl for this pointer!) or the Clean Energy Linked Open Data Cloud:

I believe that the first generation of LOD cloud has done a great job. It has visualised the general principles of linked data and was able to communicate the idea behind. It even helped – at least in the very first versions of it – to identify possibly interesting data sets. And most of all: it showed how fast the cloud was growing and attracted a lot of attention.

But now it’s time to clean up:

A first step should be to make a clear distinction between the section of the LOD cloud which is open and which is not. Datasets without licenses should be marked explicitly, because those are the ones which are most problematic for commercial use, not the ones which are not open.

A second improvement could be made by making some quality criteria clearly visible. I believe that the most important one is about maintenance and authorship: Who takes responsibility for the quality and trustworthiness of the data? Who exactly is the maintainer?

This brings me to the second and most important reason for the low uptake of LOD in commercial applications:

2. Most datasets of the LOD cloud are maintained by a single person or by nobody at all (at least as stated on datahub.io)

Would you integrate a web service which is provided by a single, maybe private person into a (core-)application of your company? Wouldn’t you prefer to work with data and services provided by a legal entity which has high reputation at least in its own knowledge domain? We all know: data has very little value if it’s not maintained in a professional manner. An example for a ’good practice’ is the integrated authority file provided by German National Library. I think this is a trustworthy source, isn’t it? And we can expect that it will be maintained in the future.

It’s not the data only which is linked in a LOD cloud, most of all it’s the people and organizations ‘behind the datasets’ that will be linked and will co-operate and communicate based on their datasets. They will create on top of their joint data infrastructure efficient collaboration platforms, like the one in the area of clean energy – the ‘Trusted Clean Energy LOD Cloud‘:

REEEP and its reegle-LD platform has become a central hub in the clean energy community. Not only data-wise but also as an important cooperation partner in a network of NGOs and other types of stakeholders which promote clean energy globally.

Linked Data has become the basis for more effective communication in that sector.

To sum up: To publish LOD which is interesting for the usage beyond research projects, datasets should be specific and trustworthy (another example is the German labor law thesaurus by Wolters Kluwer). I am not saying that datasets like DBpedia are waivable. They serve as important hubs in the LOD cloud, but for non-academic projects based on LOD we need an additional layer of linked open datasets, the Trusted LOD cloud.

 

conStruct for Drupal 7

Planet RDFThu, 06/06/2013 - 19:43

Categories:

RDF

For more than a year we have been developing a completely new version of conStruct for Drupal 7 for one of our clients.

conStruct for Drupal 6 is really decoupled from Drupal and all the other contributed modules; in a word, it was not playing nice with Drupal. The goal of this new version has been to change that situation. The focus of this completely new conStruct module has been to create a series of connector modules that bridge most of Drupal’s core functionalities with remote structWSF instances.

We wanted to make sure that Drupal developers could manipulate content, within Drupal, that is hosted in structWSF instance(s). The best way to start aiming for that goal was to make sure that all of the core Drupal APIs commonly used by Drupal developers could be used to manipulate structWSF data like if it was native in Drupal. This is what these connectors are about.

The development of conStruct for Drupal 7 is not finished, but it is available in the Git repository. There is still refactoring and improvements required, mainly to make it easier to use and understand, but all of the code is working properly and is already used on production sites.

conStruct As a Large Scale Drupal Implementation

Those who follow the evolution of conStruct know that conStruct’s main goal is to use Drupal as a user interface for structWSF for administrative purposes, or for creating complete portals like the NOW portal. However, in our initial versions, Structured Dynamics’ purpose was to not tightly integrate with Drupal. Over time, though, we have seen broad acceptance for the Drupal front end and Drupal itself is evolving in ways compatible with semantic technologies.

What is changing with conStruct for Drupal 7, with all these connectors, is that we are now using conStruct to bridge Drupal with structWSF server instances. We supercharge Drupal 7′s capabilities with structWSF. Our evolution to a tighter Drupal coupling means the ability to manage, query, search, data mine, million of entities; to have vocabularies of tens of thousands of concepts; and to enable the querying of all of these entities and their content from any kind of devices or systems via a family of web services endpoints.

This is the initial version of what is (or should be) Drupal LSD for Structured Dynamics: A semantic web service framework backend system for Drupal.

conStruct’s Drupal Connectors

Here is the initial list of the connectors that exists:

  • structFieldStorage: this module creates a new structfieldstorage field storage system that can be used by Drupal fields to save the fields’ data into a remote structWSF instance. This is used to enable the Content Type entities to be saved into a structWSF instance. It is an extension of the Drupal field storage system
  • structEntities: this module creates a new Entity Type called the Resource Type that is used to see all the structWSF indexed records as native Entities in Drupal. This means that the Entity API can be used to manipulate any content in structWSF
  • structViews: this module creates a new data source for Views 3. This means that the Views 3 user interface is used to generate structWSF Search endpoint queries instead of SQL queries
  • structSearchAPI: this module exposes new search indexes to the Search API. This means that the Search API can be used to query a structWSF instance.

I will write about all these connectors individually in upcoming blog posts. I will cover their design, architecture and usage.

 

ESWC 2013 Panel - Semantic Technologies for Big Data Analytics: Opportunities and Challenges

Planet RDFTue, 06/04/2013 - 14:04

Categories:

RDF

I was invited to the ESWC 2013 "Semantic Technologies for Big Data Analytics: Opportunities and Challenges" panel on 29th May 2013 in Montpellier, France. The panel was moderated by Marko Grobelnik (JSI), with panelists Enrico Motta (KMi), Manfred Hauswirth (NUIG), David Karger (MIT), John Davies (British Telecom), José Manuel Gómez Pérez (ISOCO) and Orri Erling (myself).

Marko opened the panel by looking at the Google Trends search statistics for big data, semantics, business intelligence, data mining, and other such terms. Big data keeps climbing its hype-cycle hill, now above semantics and most of the other terms. But what do these in fact mean? In the leading books about big data, the word semantics does not occur.

I will first recap my 5 minute intro, and then summarize some questions and answers. This is from memory and is in no sense a full transcript.

Presentation

Over the years we have maintained that what the RDF community most needs is good database. Indeed, RDF is relational in essence and, while it requires some new datatypes and other adaptations, there is nothing in it that is fundamentally foreign to RDBMS technology.

This spring, we came through on the promise, delivering Virtuoso 7, packed full of all the state-of-the-art tricks in analytics-oriented databasing, column-wise compressed storage, vectored execution, great parallelism, and flexible scale-out.

At this same ESWC, Benedikt Kaempgen and Andreas Harth presented a paper (No Size Fits All -- Running the Star Schema Benchmark with SPARQL and RDF Aggregate Views) comparing Virtuoso and MySQL on the star schema benchmark at 1G scale. We redid their experiments with Virtuoso 7 at 30x and at 300x the scale.

At present, when running the star schema benchmark in SQL, we outperform column-store pioneer MonetDB by a factor of 2. When running the same star schema benchmark in SPARQL against triples as opposed to tables, we see a slowdown of 5x. When scaling from 30 to 300G and from one to two machines, we get linear increase in throughput, 5x longer for 10x more data.

Coming back to MySQL, the run with 1G takes about 60 seconds. Virtuoso SPARQL does the same on 30x the data in 45 seconds. Well, you could say that we should go pick on somebody in our series and not MySQL, being not relevant for this. Comparing with MonetDB and other analytics column stores is of course more relevant.

For cluster scaling, one could say that star schema benchmark is easy, and so it is, but even with harder ones, which do joins across partitions all the time, like the BSBM BI workload, we get scaling that is close to linear.

So, for analytics, you can use SPARQL in Virtuoso, and run circles around some common SQL databases.

The difference between SQL and SPARQL comes from having no schema. Instead of scanning aligned columns in a table, you do an index lookup for each column. This is not too slow if there is locality, as there is, but still a lot more than when talking about a multicolumn column-compressed table. With more execution tricks, we can maybe cut this to 3x.

The beach-head of workable RDF-based analytics on schema-less data has been attained. Medium-scale data, to the single-digit terabytes, is OK on small clusters.

What about the future?

First, Big Data means more than querying. Before meaningful analytics can be done, the data must generally be prepared and massaged. This means fast bulk load and fast database-resident transformation. We have that via flexible, expressive, parallelizable stored procedures and run time hosting. One can do everything one does in MapReduce right inside the database.

Some analytics cannot be expressed in a query language. For example, graph algorithms like clustering generate large intermediate states and run in many passes. For this, bulk synchronous processing frameworks like Giraph are becoming popular. We can again do this right inside the DBMS, on RDF or SQL tables. There is great platform utilization and more flexibility than in strict BSP, while being able to do any BSP algorithm.

The history of technology is one of divergence followed by reintegration. New trends, like Column stores, RDF databases, key value stores, or MapReduce, start as one-off special-purpose products, and the technologies then find their way back into platforms addressing a broader functionality.

The whole semantic experiment might be seen as a break-away from the web, if also a little from database, for the specific purpose of exploring schemaless-ness, universal referenceability of data, self-describing data, and some inference.

With RDF, we see lasting value in globally consistent identifiers. The URI "superkey" is the ultimate silo-breaker. The future is in integrating more and more varied data and a schema-first approach is cost-prohibitive. If data is to be preserved over extended lengths of time, self-description is essential; the applications and people that produced the data might not be around. Same for publishing data for outside reuse.

In fact, many of these things are right now being pursued in mainstream IT. Everybody is reinventing the triple, whether by using non-first normal form key-value pairs in an RDB, tagging each row of a table with the name of the table, using XML documents, etc. The RDF model provides all these desirable features, but most applications that need these things do not run on RDF infrastructure.

Anyway, by revolutionizing RDF store performance, we make this technology a cost-effective alternative in places where it was not such before.

To get much further in performance, physical storage needs to adapt to the data. Thus, in the long term, we see RDF as a lingua franca of data interchange and publishing, supported by highly scalable and adaptive databases that exploit the structure implicit in the data to deliver performance equal to the best in SQL data warehousing. When we get the schema from the data, we have schema-last flexibility and schema-first performance. The genie is back in the bottle, and data models are unified.

Questions and Answers Q: Is the web big data?

David Karger: No, the shallow web (i.e., static web pages for purposes of search) is not big data. One can put it in a box and search. But for purposes of more complex processing, like analytics on the structure of the whole web, this is still big data.

Q: I bet you still can't do analytics on a fast stream.

Orri Erling: I am not sure about that, because when you have a stream -- whether this is network management and denial of service detection, or managing traffic in a city -- you know ahead of time what peak volume you are looking at, so you can size the system accordingly. And streams have a schema. So you can play all the database tricks. Vectored execution will work there just as it does for query processing, for example.

Q: I did not mean storage, I meant analysis.

Orri Erling: Here we mean sliding windows and constant queries. The triple vs. row issue also seems the same. There will be some overhead from schema-lastness, but for streams, I would say each has a regular structure.

John Davies: For example, we gather gigabytes a minute of traffic data from sensors in the road network and all this data is very regular, with a fixed schema.

Manfred Hauswirth: Or is this always so? The internet of things has potentially huge diversity in schema, with everything producing a stream. The user of the stream has no control whatever on the schema.

Marko Grobelnik: Yes, we have had streams for a long time -- on Wall Street, for example, where these make a lot of money. But high frequency trading is a very specific application. There is a stream, some analytics, not very complicated, just fast. This is one specific solution, with fixed schema and very specific scope, no explicit semantics.

Q: What is big data, in fact?

David Karger: Computer science has always been about big data; it is just the definition of big that changes. Big data is something one cannot conveniently process on a computer system. Not without unusual tricks, where something trivial, like shortest path, becomes difficult just because of volume. So it is that big data is very much about performance, and performance is usually obtained by sacrificing the general for the specific. The semantic world on the other hand is after something very general and about complex and expressive schema. When data gets big, the schema is vanishingly small in comparison with the data, and the schema work gets done by hand; the schema is not the problem there. Big data is not very internetty either, because the 40 TB produced by the telescope are centrally stored and you do not download them or otherwise transport them very much.

Q: Now, what do each of you understand with semantics?

Manfred Hauswirth: The essential aspect is that data is machine interpretable, with sufficient machine readable context.

David Karger: Semantics has to do with complexity or heterogeneity in the schema. Big data has to do with large volume. Maybe semantic big data would be all the databases in the world with a million different schemas. But today we do not see such applications. If the volume is high, the schema is usually not very large.

Manfred Hauswirth: This is not so far as that, for example a telco has over a hundred distinct network management systems and each has a different schema.

Orri Erling: From the data angle, we have come to associate semantic with

  • schema-lastness
  • globally-resolvable identifiers
  • self-description
When people use RDF as a storage model, they mostly do so because of schema flexibility, not because of expressive schemas or inference. Some use a little inference, but inference or logics or considerations of knowledge representation do not in our experience drive the choice. Conclusion

In conclusion, the event was rather peaceful, with a good deal of agreement between the panelists and audience and no heated controversy. I hoped to get some reaction when I said that semantics was schema flexibility, but apparently this has become a politically acceptable stance. In the golden days of AI this would not have been so. But then Marko Grobelnik did point out that the whole landscape has become data driven. Even in fields like natural language, one looks more at statistics than deep structure: For example, if a phrase is often found on Google, it is proper usage.

There’s Money in Linked Data

Planet RDFTue, 06/04/2013 - 09:58

Categories:

RDF

I believe that the ongoing debate whether there ‘is money in linked (open) data or not’ is a bit misleading. ‘Linked (open) data’ is not only the data itself. It’s much more, even more than yet another technology stack. Linked data is most of all a set of principles how to organize information in agile organizations that are embedded in fast moving and dynamic environments. And from this perspective there is a huge amount of money in it – but let me refine that a bit later.

Crying out loud in 2013 that ‘there is no money in linked data’ is an important step towards the right direction because it points out that data publishers should be more precise with data licensing. Although quite flexible licensing models would already exist – it’s the people (and probably other legal entities) who forget to publish their data together with statements about the ‘openness’ of it. As a result, the data remains closed for commercial users. This hasn’t been properly noticed in the early days of the linked open data cloud since commercial users haven’t been around at all (in contrast to academic institutions which considered the LOD cloud to be a wonderful playground). It’s the same thing with linked data as a technology and linked data as a set of standards: the standards and the technology stack are mature now (just think about Virtuoso’s brilliant SPARQL performance, for example), but most people from IT still wouldn’t have things like URIs, RDF and SPARQL off the top of their head when they seek solutions for powerful data integration methodologies.

Why is that?

I believe that so far ‘linked data’ has always been perceived by people from outside the linked data core-community only as a new way to organize data on the web, thus technologies are still not mature for enterprises.

But the truth is, that linked data has at least a threefold nature. Linked data is

  1. a method to organize information in general, not only on the web but also in enterprises
  2. a set of standards which is flexible and expressive enough to link data across boundaries (organizational, political, philosophical), cultures and languages
  3. a way of using IT and information in a quite intuitive way, very close to the patterns like human beings tend to create realities, thus comprehensible also for non-techies.

I think that technologists have made a brilliant job so far with creating the linked data technology stack, its underlying standards, triple-stores and quad-stores, reasoners etc., and for specialists it’s absolutely clear why this kind of technologies will outperform traditional databases, BI-tools, search engines etc. by far.

But: the crucial point now is that enterprises have to adapt linked data technologies inside their corporate boundaries (and not only for SEO purposes or the like). The key question is not whether there is enough LOD out there for app-makers or not. High-quality LOD will be produced very quickly as soon as there are commercial consumers like large enterprises. I am not talking about use cases for linked data in the fields of data publishing or SEO.

The main driver for the further Linked Data development will be enterprises which embrace LD technologies for their internal information management.

It’s true that there are already some large companies (like Daimler - meet them at this year’s I-SEMANTICS in Graz!) dealing with that question but to be honest: there is not the same hype around ‘linked data’ as we can see with ‘big data’. IBM, Microsoft & Co. are not that interested in linked data of course because it is a platform by itself and doesn’t foresee any kind of lucrative lock-in effects. Internet companies like Google and Facebook make use of linked data quite hesitantly. Although Facebook’s Graph Search or Google’s Knowledge Graph contain large portions of this kind of technology, Google would never say ‘oh, we are a semantic web company now, we make heavy use of linked data, and of course we will also contribute to the LOD cloud.’

Why is that? Simply spoken, because through the glasses of Google, Facebook & Co. the internet is a huge machine which produces data for them. Not the other way around.

But shouldn’t the enterprise customers themselves be interested in a cost-effective way of information management? They are, but as stated before, they haven’t perceived linked data as such, although it clearly is.

To develop technologies, we need critical questions, and of course the most critical ones always come from the inside of a community or movement. But time has come to spread the good news for the ‘outside’.

  • Yes, databases which rely on linked data standards have become mature and enough performing for many query types so that they outperform even ‘traditional’ relational databases
  • Yes, also issues which are critical for enterprise usage like privacy and security have been solved by most linked data technology vendors
  • Yes, there is a critical mass of available LOD sources (for example UK Ordnance Survey) and also of high-quality thesauri and ontologies (for example Wolter Kluwer’s working law thesaurus) to be reused in corporate settings
  • Yes, there is a volume of developers and consultants on the labor market (in the U.S. as well as in the E.U.) which is big enough to being able to execute large linked data projects
  • Yes, there are tons of business cases that can benefit from linked data. Linked data and semantic web technologies should be considered as core technologies for any information architecture, at least in larger corporations
  • Yes, SPARQL Query Language is not only a second SQL but comes with some brilliant features like transitive queries which help to save a lot of time when developing applications like business intelligence reporting and analysis
  • Yes, Linked Data has the potential to become the basis for a large variety of tools which help decision-makers (not only in enterprises but also in politics) to become true ‘digerati’ instead of being degraded to masters of the ‘bullshit bingo’.

Yes, this list can be further extended and it is a core element for the further expansion of the LOD cloud. It’s the enterprises that will drive the next level of maturity of the linked data landscape. Because at the end of the day it’s only them who will pay or have already paid the bill for open (government) data.

Schema.org and JSON-LD

Planet RDFTue, 06/04/2013 - 05:41

Categories:

RDF
We'd like to take a minute to share our enthusiasm for some recent work at W3C: JSON-LD.

Schema.org is all about shared vocabulary - it helps integrate data across applications, Web sites and data formats. We are adding JSON-LD to the list of formats we recommend for use with schema.org, alongside Microdata and RDFa - each has strengths and weaknesses for different usage scenarios.

In HTML, schema.org descriptions can be written using markup attributes in HTML (i.e. RDFa and Microdata). However there are often cases when data is exchanged in pure JSON or as JSON within HTML. W3C's work on JSON-LD provides mechanisms for interpreting structured data in JSON that promotes interoperability with other data formats. We believe it provides value for developers and publishers, and improves the flow of information between JSON and other environments.

There are some technical details to work through on how exactly schema.org terms are defined for JSON-LD usage, but it is already clear that JSON-LD is a useful contribution to structured data sharing in the Web. Many thanks to the hardworking W3C community for creating the specification.

AKSW wins best paper award at ESWC 2013

Planet RDFMon, 06/03/2013 - 10:57

Categories:

RDF

Our paper “When to Reach for the Cloud” was awarded the best paper award at ESWC. The idea behind the paper was to provide implementations of the HR3  algorithm (the first reduction-ratio-optimal algorithm for link discovery) for parallel hardware and to devise suggestions for when to use which hardware when computing links between very large data sets. With this work, we aim to make link discovery amenable to Big Data.  Check out the paper here

Link on,

Axel on behalf of AKSW

Read Write Web — Monthly Open Thread — (May 2013)

Planet RDFFri, 05/31/2013 - 14:47

Categories:

RDF
Summary

WWW 2013 took place this year, in Rio de Janeiro, Brazil.  There was a packed program, including an interesting workshop entitled “Linked Data on the Web“, four papers of which, were dedicated to the Read Write Web.

The big news in linked data is that gmail has started to add JSON LD to their popular email service.  This allows developers to embed structured data into an email, in the form of Reviews, RSVPs, Interactive actions and Flight cards.  Response has been generally positive to this move, with perhaps the possibility for couple of minor tweaks to the markup.

The following papers were presented at the Read Write Web session in Rio : R&Wbase: git for triples, OSLC Resource Shape: A language for defining constraints on Linked Data, Hydra: A Vocabulary for Hypermedia-Driven Web APIs, Reasoning over SPARQL.  The website w3id.org was also released, which promises to be a permanent home for COOL URIs.

 

Communications and Outreach

The RWW group welcomes new members.  In particular, we had a great introduction from read write web veteran, Henri Bergius.  Henri has been working on read write topics for a number of years.  Notably midgaurd in the 1990s, and more recently, the impressive create.js.  If you’re unfamiliar with Henri’s work you may enjoy this video that goes through many core concepts.

 

Community Group

There has been some discussion on the mailing list, but also with the semantic web group, and some IETF folks as to the best way to use HTTP to identify a user to a server.  This would enable a user to identify itself to a server without having to rely on the subjecctAltName field in a client side TLS certificate, or other methods.  Thought had been to reuse the “From” header, however this seems tightly bound to email.  Current thinking is that we draft text for a new header, then find a name for it.

 

Applications

Our co-chair, Andrei Sambra, met the developers of the Cozy Cloud project in Paris.  There’s hope that this system can be combined with the my-profile project to become a kind of read write web example of a social dashboard.  Cozy Cloud comes with a dozen or so cloud enabled apps, and has also been short listed for the LeWeb London best startup competition, so wishing them best of luck!

 

Last but not least…

Activity Streams, the popular social network data exchange format, have been dipping their toes into Linked data with, Activity Streams 2.0, a JSON LD powered activity stream.  This currently does not have official standing but the reception has been good, and there is talk of pushing it through the IETF.  Hopefully this can finally lead to a united and interoperable social web for all!

The joy of timezones

Planet RDFFri, 05/31/2013 - 03:44

Categories:

RDF
Timezones are annoying and inconvenient. And that's before legislatures get involved and start mucking about with them. Nevertheless, in the real world, sometimes you just gotta deal.

Google’s spiritus rector Eric Schmidt visited AKSW

Planet RDFThu, 05/30/2013 - 20:10

Categories:

RDF

Today Google’s spiritus rector Eric Schmidt visited AKSW to learn about the newest Linked Data technology and figuring out how to replace the Google’s proprietary knowledge graph with open DBpedia.

Joke aside: Together his co-author Jared Cohen he  visited University of Leipzig to discuss  their new book “The New Digital Age: Reshaping the Future of People, Nations and Business” with students and researchers.

Eric and Jared spend more than an hour answering questions, talking and joking. The major topics were the Internet, freedom of expression, privacy, copyright, and driving on German Autobahn. One of their key ideas seems to be that technology and the Internet can help to make the world better by spreading values such as freedom of speech and ultimately democracy. Generally an agreeable opinion, but as we now have a virtual reality on the Internet, we also seem to have sometimes a virtual democracy or how else could George W. and friend’s succeed in taking over and raiding their country, lying to world public to start a useless war (in Iraq) costing ten thousands lifes on all sides. Especially regarding the latter the internet censorship of the Chinese government (which also was a topic) appears like a rather minor shortcoming.

UW iSchool joins DCMI as inaugural Institutional Member

Planet RDFWed, 05/29/2013 - 23:59

Categories:

RDF

2013-05-29, DCMI is please to announce that the Information School of the University of Washington in Seattle, USA, has joined DCMI as the inaugural Institutional Member in the Initiative's revised membership programs. As a leading member of the iSchool movement, the University of Washington Information School is a model for other information schools around the globe. Assistant Professor Joseph T. Tennis will represent the Information School on the DCMI Oversight Committee. Regional, Institutional and Supporting members of DCMI are pivotal to guaranteeing the continuing contributions of DCMI to the metadata community. Information about the revised membership programs is available at http://dublincore.org/about/membershipPrograms/.

DCMI-AsiaPac regional workshop in Singapore: "RDA, DC and Linked Data"

Planet RDFWed, 05/29/2013 - 23:59

Categories:

RDF

2013-05-29, DCMI-AsiaPac will hold a regional workshop in Singapore on 15 August 2013 as part of the DCMI Regional Meetings Series. The theme for the one-day workshop will be "RDA, DC and LOD" and will be comprised of two half-day seminars. The Workshop will be held the day before the IFLA IT Section's conference on "User interaction based on library linked data" on 16 August. IFLA WLIC itself will run from 17-23 Aug 2013. Through the Workshop, the organizers intend to raise the awareness among librarians in the Asian region on the implementation of RDA and how library metadata (specifically DC) can be exposed as linked data to improve visibility and enhance collection usage. The objective is also to build confidence among Asian librarians to work well in the digital arena and be comfortable enough to adopt new technologies that will help improve their libraries�f services. A secondary objective is to build a community for the DCMI Asia Task Group where regular discussion on metadata matters can be established. More information about the workshop is available at http://dcevents.dublincore.org/BibData/ap2013.

Interview: Oracle on Data on the Web - Part 1 with Reza B'Far

Planet RDFWed, 05/29/2013 - 15:11

Categories:

RDF

This is part 1 of a 2-part interview with Oracle about data on the Web. In part 1, the focus is on the consumption of data by applications, such as those that enterprises provide to their employees. In part 2, the focus is on back-end data management.

For this part of the interview I spoke with Reza B'Far, Vice President of Development.

IJ: How does the Oracle apps team use Web standards for data?

Reza: Oracle uses a number of W3C standards, but one of my focus areas is the application of Semantic Web technologies. OWL and PROV are the two standards we've used in our Fusion applications. Fusion applications bring together and integrate Oracle acquisitions from the past decade related to enterprise resource planning (ERP), human resources, supply chain management, financials, customer relations, and so on.

IJ: What are some examples of Fusion applications?

Reza: For example, enterprise customers use Fusion GRC to ensure they comply with various government rules and regulations. They also use the tools to detect over-payment or fraud. In the team that I run, the problems of discovering things like overpayment, SOD violations, fraud, and others are best solved by using an artificial intelligence (AI) approach. We have found that OWL provides an optimal way to capture the knowledge required by the AI engine, for example, for intelligent searches.

IJ: What are intelligent searches?

Reza: These are heuristic-based searches. Take the example of trying to detect fraud in an enterprise environment where a lot of systems interact. Suppose Jack reports to Joe and they collude in some way on one transaction out of 100,000. How do we detect this? One might try to look at all possible permutations of the transaction in the system, but there's no known solution if you take this sort of brute force approach where you simply look at every single possible permutation.

Reza: On the other hand, if you use heuristics based on domain expertise, you can make your search engine smarter and reduce the problem space. The challenge is how to capture the domain knowledge. There are a variety of ways to do this, even several approaches using Semantic Web technology. However, we found OWL worked best for us. OWL lets us represent all the entities in the system as well as statements like "the probability of fraud due to duplicate payment or overpayment is high." OWL is very versatile because it does not require you to use a single grand schema to represent your world. And, beyond heuristic reasoning, OWL gives us the secondary benefit of data aggregation.

IJ: So you have OWL statements and RDF data. Then what happens?

Reza: We have a reasoning engine --the Planner and Reasoning Engine-- which uses the heuristics and walks through the data to verify compliance, detect fraud, etc.

IJ: What were you using before OWL?

Reza: Though we did capture some data using a variety of formats, there really was nothing before OWL. We started using OWL to scale this product line by allowing our partners to add their own rules starting roughly 5 years ago. As an example, a company like Deloitte might use their own rules expressed in OWL, customer data, and Oracle's tools.

IJ: What is the reception to using OWL?

Reza: Fairly positive. The biggest barrier to OWL adoption has been that people are unfamiliar with it. So we have invested in educating our partners and customers, and this investment has paid off. Within Oracle, we've gone from "OWL is weird" to "OWL is a possibility." But we need more champions with specific applications that generate revenue.

IJ: How are you using PROV?

Reza: PROV is at least is important to us as OWL. Until PROV, one of the hugest problems we faced was maintaining transaction audit trails in a heterogeneous environment in a standard and compatible way. Audit trails are described with literally millions of different formats in different organizations. This used to mean it was impossible to create a single audit time line. PROV solves this problem. We now provide (and consume) a PROV feed that unifies the audit trails generated by transactions across heterogeneous systems.

IJ: What's an example?

Reza: Suppose I own a retail store and I contract with someone to help out during the holiday season. Months later that person becomes an employee. PROV lets me track changes over time for metadata from heterogeneous systems. It provides a standardized temporal structure for metadata, allowing me to aggregate temporal data from different systems. This lets me do things like look at payment data and changes to employee status and detect fraud.

IJ: Are there other Semantic Web technologies you are thinking of adopting?

Reza: We are actively looking at the opportunity of using Linked Data Platform (LDP) specifications.

IJ: Any comments about vocabulary management?

Reza: I think there's a dissonance in vocabulary creation, particular related to Dublin Core. There's no standard mechanism to rationalize OWL implementations with Dublin core. Dublin Core defines a bunch of canonical domain objects. Dublin Core should be mashed into OWL. Or there could be guidelines on using OWL for consistency with Dublin Core. There is a risk of stumbling when using both unless you use them with consistency.

IJ: Thanks for your time!

Subscribe to The Universal Pantograph aggregator - RDF