Planet RDF

Subscribe to Planet RDF feed
Updated: 12 hours 18 min ago

W3C’s RDF Validation Workshop – Practical Assurances for Quality RDF Data

Wed, 05/22/2013 - 15:34

Categories:

RDF

W3C announced today a RDF Validation Workshop – Practical Assurances for Quality RDF Data, 10-11 September 2013, in Cambridge, USA. The Semantic Web has demonstrated considerable value for collaborative contributions to data. Adoption in many mission-critical environments requires data to conform to specified patterns. Validation in a banking context shares many requirements with quality assurance of linked clinical data. Systems like Linked Open Data, which don’t have formal interface specifications, share these validation needs. Most data representation languages used in conventional settings offer some sort of input validation, ranging from parsing grammars for domain-specific languages to XML Schema or RelaxNG for XML structures. While the distributed nature of RDF affects the notions of “validity”, tool chains need to be established to ensure data integrity. The goal of this workshop is to discuss use cases for data validation on the Semantic Web with development of technologies to enable those use cases. W3C membership is not required to participate. The event is open to all. All participants are required to submit a position paper by 30 June 2013.

Neighbourhoods of Winnipeg: A Community Semantic Portal

Tue, 05/21/2013 - 19:45

Categories:

RDF

NOW Portal Introduction from City of Winnipeg

Introduction

I am proud to announce the new NOW (Neighbourhoods Of Winnipeg) semantic web portal! This new and innovative semantic web portal was publicly announced by the Mayor of Winnippeg City last week.

The NOW (Neighbourhoods of Winnipeg) portal is “a new Web portal (the “Portal”) produced by the City of Winnipeg to provide broad, dynamic and interactive access to local and neighbourhood information. Designed for easy access and use by all citizens, businesses, community organizations and Governments, the information on the site includes municipal data, census and demographic information, economic development information, historical data, much spatial and mapping information, and facilities for including and sharing data by external groups and constituencies.”

I would suggest you to read Mike Bergman’s blog post about this new semantic web portal to have the proper background about that initiative by the city of Winnipeg and how it uses the OSF (Open Semantic Framework) as its foundational technology stack.

This project has been the springboard that led to the Open Semantic Framework version 1.1. Multiple pieces of the framework have been developed in relation to this project, and more particularly pieces like the sWebMap semantic component and several improvements to the structWSF web services endpoints and conStruct modules for Drupal 6.

Development of the Portal

The development plan of this portal is composed of four major areas:

  1. Development of the data structure of the municipal domain by creating a series of ontologies
  2. Conversion of existing data asset using this new data structure
  3. Creation of the web portal by creating its design and by developing all the display templates
  4. Creation of new tools to let users interact with the data available on the portal

Structured Dynamics has been involved in #1, #2 and #4 by providing design and development resources, technology transfer sessions and material and supporting internal teams to create, maintain and deploy their 57 publicly available datasets.

The Data Structure

This technology stack does not have any meaning without the proper data and data structures (ontologies) in place. This gold mine of information is what drives the functionality of the portal.

The portal is driven by 12 ontologies: 2 internal and 10 external. The content of the 57 publicly available datasets is defined by the classes and properties defined in one of these ontologies.

The two internal ontologies have been created jointly by Structured Dynamics and the City of Winnipeg, but they are extended and maintained by the city only.

These ontologies are maintained using two different kind of tools:

  1. Protege
  2. structOntology

Protege is used for the big development tasks such as creating a big number of classes and properties, to do a big reorganization of the classes structure, etc.

structOntology is used for quick ontological changes to have an immediate impact on the behaviors of the portals such as label changes, SCO ontology property assignments to change the behavior of some of the tools that exist in the portal, etc.

structOntology can also be used by portal users to understand the underlying data structure used to define the data available on the portal. All users have access to the reading mode of the tool which let them browse, search and export the loaded ontologies on the portal.

The Data

Except for rare exceptions such as the historical photos, no new data has been created by the City of Winnipeg to populate this NOW portal. Most of its content comes from existing internal sources of data such as:

  • Conventional relational databases
  • GIS (Geographic Information System) on-top of relational databases
  • Spreadsheets

All of the conventional relation databases and legacy data from the GIS systems has been converted into RDF using the FME Workbench ETL system. All of the FME workbench templates are mapping the relational data into RDF using the ontologies loaded into the portal. All of the geolocated records that exist in the portal come from this ETL process and have been converted using FME.

Some smaller datasets come from internal spreadsheets that got modified to comply with the commON spreadsheet format that is used to convert spreadsheet (CSV/TSV) data files into RDF.

All of the dataset creation and maintenance is managed internally by the City of Winnipeg using one of these two data conversion and importation processes.

Here are some internal statistics of the content that is currently accessible on the NOW portal.

General Portal

These are statistics related to different functionalities of the portal.

  • Number of neighbourhoods: 236
  • Number of community areas: 14
  • Number of wards: 15
  • Number of neighbourhood clusters: 23
  • Number of major site sections: 7
  • Total number of site pages: 428,019
    • Static pages: 2,245
    • Record-oriented pages: 425,874
    • Dynamic (search-based) pages: infinite
  • Number of documents: 1,017
  • Number of images: 2,683
  • Number of search facets: 1,392
  • Number of display templates: 54
  • Number of links: 1,067
    • External links: 784
    • Internal links: 283
Site Data

These statistics show the things that are available via the portal, what are their types, their properties, what is the quantity of data that is searchable, manipulable and exportable from the portal.

  • Number of datasets: 57
  • Number of records: 425,874
    • Number of geolocational records: 418,869
      • Point of interest (POI) records: 193,272
      • Polygon records: 218,602
      • Path (route) records: 6,995
  • Number of classes (types): 84
  • Number of properties: 1,308
  • Number of triple assertions: 8,683,103
Sharing Content

An important aspect of this portal is that all of the content is contextually available, in different formats, to all of the users of the portal. Whether you are browsing content within datasets, searching for specific pieces of content, or looking at a specific record page, you always have the possibility to get your hands on the content that is being displayed to you, the user, with a choice of five different data formats:

Export Page Content

All content pages can be exported in one of the formats outlined above. In the bottom right corner of these pages you will see a Export button that you can click to get the content of that page in one of these formats.

Export Search Content

Every time you do a search on the portal, you can export the results of that search in one of the formats outlined above. You can do that by selecting the Export tab, and by selecting one of the formats you want to use for exporting the data.

Export Datasets

You can export any publicly available dataset from the portal. These datasets have to be exported in slices if they are too big to fit in a single slice. The datasets can be exported in one of the formats mentioned above.

Export Census

Users also have the possibility to export census data, from the census section of the portal, in spreadsheets. They only have to select the Tables tab, and then to click the Export Spreadsheet button.

Export Ontologies

The export functionality would not be complete without the ability to consult and export the ontologies that are used to describe the content exposed by the portal. These ontologies can be read from the ontologies reader user interface, or can be exported from the portal to be read by external ontologies management tools such as Protege.

Portal Design

The portal is using Drupal 6 as its CMS (Content Management System). The Drupal 6 instance communicates with structWSF using the conStruct module, which acts as a bridge between a Druapal portal and a structWSF web service network.

Here are the main design phases that have been required to create the portal:

  1. Creation of the portal’s design, and the Drupal 6 theme that implements it
  2. Creation of the Search and Browse results templates
  3. Creation of the individual records’ page design and templates based on their type
  4. Creation of the sWebMap search results templates.

The portal’s design has been created internally by the City of Winnipeg and by Tactica based on the Citizen DAN demo. Tactica also worked on another Citizen DAN like portal called MyPeg.ca.

Semantic Components

The NOW Web portal is using a series of tools that are called the Semantic Components. These are a set of Flash and JavaScript tools that can be embedded within any web page and that can easily communicate with structWSF instance(s). They display information in all kinds of charts, they can display document reading widgets, they can create dashboards of structured data, etc. The initial set of Semantic Components was developed for the MyPeg.ca project back in November 2010. This was before Steve Jobs announced that Apple would not support Adobe Flash, and far before Google announced that it would drop support for it as well.

Since the NOW portal wanted to re-use as much as possible to lower the development cost related to the portal, they choose to use the complete OSF stack which includes these Semantic Components.

However, when we participated in developing this new NOW portal, we did extended the set of Semantic Components by creating the most complex Semantic Component: the sWebMap. However, because of the two announcements mentioned above, we choose to move forward and to create the sWebMap Semantic Component using JavaScript instead of Flash. The other Semantic Component tools that have been developed in Flash have not yet been ported into JavaScript.

Conclusion

The new NOW semantic web portal’s main asset is its data: how it can be searched (with traditional search engines or using a semantic component to search, browse, filter and localize results), displayed and exported. This portal has been developed using a completely free and open source semantic platform that has been developed from previous projects that open sourced their code.

I consider this portal a pioneer in the way municipal organization will provide new online services to their citizens and to the commercial enterprises based on the quality of the data that will be exposed via such Web portals.

seevl Attends: MusicTechFest, Google I/O and The Music Technology Showcase

Mon, 05/20/2013 - 14:05

Categories:

RDF

We Attend MusicTechFest, Google I/O and The Music Technology Showcase These are busy times for us! In addition to the launch of the mobile version of seevl for Deezer, we’re also involved in three exciting events: MusicTechFest, Google I/O and The Music Technology Showcase! This weekend, Pete traveled to MusicTechFest in London to present seevl to a warm [...]

The post seevl Attends: MusicTechFest, Google I/O and The Music Technology Showcase appeared first on seevl.net.

Gmail adds support for embedding semantic data

Fri, 05/17/2013 - 14:18

Categories:

RDF

Email is not dead yet.

Google announced support for embedding and using semantic information in gmail messages in either Microdata or JSON-LD. They currently handle several use cases (actions) that use the semantic markup and a way to define new actions.

Providing a way to embed data in JSON-LD should open up the system for experimentation with other vocabularies beyond schema.org. Since the approach just leverages the general ability to embed semantic data in HTML it is not restricted to gmail and can be used by any email environment that supports messages whose content is encoded as HTML.

We hope that this will lead to many exciting ideas, as developers experiment with applications that use the mechanism to embed and understand concepts, entities and facts in email messages.

JSON-LD Algorithms and API 1.0: 2nd Last Call

Thu, 05/16/2013 - 15:38

Categories:

RDF

The JSON-LD Community Group and the RDF Working Group have announced the 2nd Last Call publication of the JSON-LD 1.0: Algorithms and API specification.

JSON-LD harmonizes the representation of Linked Data in JSON by describing a common JSON representation format for expressing directed graphs; mixing both Linked Data and non-Linked Data in a single document. The format has already been adopted by large companies such as Google in their Gmail product and is available to over 425 million customers around the world.

The syntax is designed to not disturb already deployed systems running on JSON, but provide a smooth upgrade path from JSON to JSON-LD. It is primarily intended to be a way to use Linked Data in Web-based programming environments, to build interoperable Linked Data Web services, and to store Linked Data in JSON-based storage engines. The JSON-LD 1.0 Algorithms and API specification describes useful Algorithms for working with JSON-LD data. It also specifies an Application Programming Interface that can be used to transform JSON-LD documents in order to make them easier to work with in programming environments like JavaScript, Python, and Ruby.

This is a 2nd Last Call publication for the JSON-LD 1.0 Algorithms and API specification. Changes since the previous publication include a shift to use a Future’s based API design approach, better base URL processing, and better translation of data from RDF.
All substantive technical work on the specification is complete. Feedback on both specifications is encouraged and should be sent to public-rdf-comments@w3.org. The 2nd Last Call period will end in 3 weeks, on June 06th 2013.

If you would like to learn more about JSON-LD, there is a helpful introductory video on the topic as well as the json-ld.org website.

PhD proposal: A Semantic Resolution Framework for Manufacturing Capability Data Integration

Tue, 05/14/2013 - 01:56

Categories:

RDF

Ph.D. Dissertation Proposal

A Semantic Resolution Framework for
Manufacturing Capability Data Integration

10:30am Tuesday, May 14, 2013, ITE 346, UMBC

Yan Kang

Building flexible manufacturing supply chains requires interoperable and accurate manufacturing service capability (MSC) information of all supply chain participants. Today, MSC information, which is typically published either on the supplier’s web site or registered at an e-marketplace portal, has been shown to fall short of the interoperability and accuracy requirements. This issue can be addressed by annotating the MSC information using shared ontologies. However, ontology-based approaches face two main challenges: 1) lack of an effective way to transform a large amount of complex MSC information hidden in the web sites of manufacturers into a representation of shared semantics and 2) difficulties in the adoption of ontology-based approaches by the supply chain managers and users because of their unfamiliar of the syntax and semantics of formal ontology languages such as OWL and RDF and the lack of tools friendly for inexperienced users.

The objective of our research is to address the main challenges of ontology-based approaches by developing an innovative approach that can effectively extract a large volume of manufacturing capability instance data, accurately annotate these instance data with semantics and integrate these data under a formal manufacturing domain ontology. To achieve the objective, a Semantic Resolution Framework is proposed to guides every step of the manufacturing capability data integration process and to resolve semantic heterogeneity with minimal human supervision. The key innovations of this framework includes 1) three assisting systems, including a Triple Store Extractor, a Triple Store to Ontology Mapper and a Ontology-based Extensible Dynamic Form, that can efficiently and effectively perform the automatic processes of extracting, annotating and integrating manufacturing capability data.; 2) a Semantic Resolution Knowledge Base (SR-KB) that incrementally filled with, among other things, rules/patterns learned from errors. This SR-KB together with an Upper Manufacturing Domain Ontology (UMO) provide knowledge for resolving semantic differences in the integration process; 3) an evolution mechanism that enables SR-KB to continuously improve itself and gradually reduce the human involvement by learning from mistakes.

Committee: Yun Peng (chair), Charles Nicholas, Tim Finin, Yaacov Yesha, Boonserm Kulvatunyou (NIST)

Virtuoso 7 Release

Mon, 05/13/2013 - 16:05

Categories:

RDF

The quest of OpenLink Software is to bring flexibility, efficiency, and expressive power to people working with data. For the past several years, this has been focused on making graph data models viable for the enterprise. Flexibility in schema evolution is a central aspect of this, as is the ability to share identifiers across different information systems, i.e., giving things URIs instead of synthetic keys that are not interpretable outside of a particular application.

With Virtuoso 7, we dramatically improve the efficiency of all this. With databases in the billions of relations (also known as triples, or 3-tuples), we can fit about 3x as many relations in the same space (disk and RAM) as with Virtuoso 6. Single-threaded query speed is up to 3x better, plus there is intra-query parallelization even in single-server configurations. Graph data workloads are all about random lookups. With these, having data in RAM is all-important. With 3x space efficiency, you can run with 3x more data in the same space before starting to go to disk. In some benchmarks, this can make a 20x gain.

Also the Virtuoso scale-out support is fundamentally reworked, with much more parallelism and better deployment flexibility.

So, for graph data, Virtuoso 7 is a major step in the coming of age of the technology. Data keeps growing and time is getting scarcer, so we need more flexibility and more performance at the same time.

So, let’s talk about how we accomplish this. Column stores have been the trend in relational data warehousing for over a decade. With column stores comes vectored execution, i.e., running any operation on a large number of values at one time. Instead of running one operation on one value, then the next operation on the result, and so forth, you run the first operation on thousands or hundreds-of-thousands of values, then the next one on the results of this, and so on.

Column-wise storage brings space efficiency, since values in one column of a table tend to be alike -- whether repeating, sorted, within a specific range, or picked from a particular set of possible values. With graph data, where there are no columns as such, the situation is exactly the same -- just substitute the word predicate for column. Space efficiency brings speed -- first by keeping more of the data in memory; secondly by having less data travel between CPU and memory. Vectoring makes sure that data that are closely located get accessed in close temporal proximity, hence improving cache utilization. When there is no locality, there are a lot of operations pending at the same time, as things always get done on a set of values instead of on a single value. This is the crux of the science of columns and vectoring.

Of the prior work in column stores, Virtuoso may most resemble Vertica, well described in Daniel Abadi’s famous PhD thesis. Virtuoso itself is described in IEEE Data Engineering Bulletin, March 2012 (PDF). The first experiments in column store technology with Virtuoso were in 2009, published at the SemData workshop at VLDB 2010 in Singapore. We tried storing TPC H as graph data and in relational tables, each with both rows and columns, and found that we could get 6 bytes per quad space utilization with the RDF-ization of TPC H, as opposed to 27 bytes with the row-wise compressed RDF storage model. The row-wise compression itself is 3x more compact than a row-wise representation with no compression.

Memory is the key to speed, and space efficiency is the key to memory. Performance comes from two factors: locality and parallelism. Both are addressed by column store technology. This made me a convert.

At this time, we also started the EU FP7 project, LOD2, most specifically working with Peter Boncz of CWI, the king of the column store, famous for MonetDB and VectorWise. This cooperation goes on within LOD2 and has extended to LDBC, an FP7 for designing benchmarks for graph and RDF databases. Peter has given us a world of valuable insight and experience in all aspects of avant garde database, from adaptive techniques to query optimization and beyond. One thing that was recently published is the results for Virtuoso cluster at CWI, running analytics on 150 billion relations on CWI’s SciLens cluster.

The SQL relational table-oriented databases and property graph-oriented databases (Graph for short) are both rooted in relational database science. Graph management simply introduces extra challenges with regards to scalability. Hence, at OpenLink Software, having a good grounding in the best practices of relational columnar (or column-wise) database management technology is vital.

Virtuoso is more prominently known for high-performance RDF-based graph database technology, but the entirety of its SQL relational data management functionality (which is the foundation for graph store) is vectored, and even allows users to choose between row-wise and column-wise physical layouts, index by index.

It has been asked: is this a new NoSQL engine? Well, there isn’t really such a thing. There are of course database engines that do not have SQL support and it has become trendy to call them "NoSQL." So, in this space, Virtuoso is an engine that does support SQL, plus SPARQL, and is designed to do big joins and aggregation (i.e., analytics) and fast bulk load, as well as ACID transactions on small updates, all with column store space efficiency. It is not only for big scans, as people tend to think about column stores, since it can also be used in compact embedded form.

Virtuoso also delivers great parallelism and throughput in a scale-out setting, with no restrictions on transactions and no limits on joining. The base is in relational database science, but all the adaptations that RDF and graph workloads need are built-in, with core level support for run-time data-typing, URIs as native Reference types, user-defined custom data types, etc.

Now that the major milestone of releasing Virtuoso 7 (open source and commercial editions) has been reached, the next steps include enabling our current and future customers to attain increased agility from big (linked) open data exploits. Technically, it will also include continued participation in DBMS industry benchmarks, such as those from the TPC, and others under development via the Linked Data Benchmark Council (LDBC), plus other social-media-oriented challenges that arise in this exciting data access, integration, and management innovation continuum. Thus, continue to expect new optimization tricks to be introduced at frequent intervals through the open source development branch at GitHub, between major commercial releases.

Related

US Government announces new open data policy

Sat, 05/11/2013 - 13:16

Categories:

RDF

The White House blog announced a new open data policy for government-held data.

“President Obama signed an Executive Order directing historic steps to make government-held data more accessible to the public and to entrepreneurs and others as fuel for innovation and economic growth. Under the terms of the Executive Order and a new Open Data Policy released today by the Office of Science and Technology Policy and the Office of Management and Budget, all newly generated government data will be required to be made available in open, machine-readable formats, greatly enhancing their accessibility and usefulness, while ensuring privacy and security.”

While the policy doesn’t mention adding semantic markup to enhance machine understanding, calling for machine-readable datasets with persistent identifiers is a big step forward.

SlideWiki at CSEDU2013 Conference

Fri, 05/10/2013 - 10:21

Categories:

RDF

CSEDU 2013, the International Conference on Computer Supported Education was taking place in Aachen, Germany this year. The conference was addressing different e-learning themes such as Information Technologies Supporting Learning, Learning/Teaching Methodologies and Assessment, Ubiquitous Learning, Social Context and Learning Environments as well as Cloud Education Environments.
On behalf of AKSW, Ali Khalili and Darya Tarasowa presented two papers namely “CrowdLearn: Crowd-sourcing the Creation of Highly-structured E-Learning Content” and “Balanced Scoring Method for Multiple-mark Questions” at the conference (with acceptance rate 13% for full papers). The corresponding slides are available on SlideWiki (CrowdLearn, Balanced Scoring). The CrowdLearn paper discussing the underlying philosophy behind SlideWiki received many attention from the audience and was also nominated for the best-paper award in the conference. There were many people who were interested in using SlideWiki for publishing their teaching material and to share their educational content with other people in an OpenCourseWare environment.

Most of the keynotes were addressing MOOCs (Massive Open Online Courses) and the new paradigms emerging on the Web for social learning. SlideWiki as an example of collaboration platform was also mentioned by professor Michael E. Auer in his keynote about the new engineering challenges in e-learning.

New Paper: "The ChEMBL database as linked open data"

Thu, 05/09/2013 - 12:45

Categories:

RDF
Yesterday, the paper "The ChEMBL database as linked open data" (doi:10.1186/1758-2946-5-23) by Andra Waagmeester (@andrawaag), Ola Spjuth (@ola_spjuth), Peter Ansell (@p_ansell), Antony Williams (@chemconnector), Valery Tkachenko, Janna Hastings, Bin Chen (@binchenindiana), David J Wild (@davidjohnwild), and me appeared in the OA JChemInf journal.

I am also indebted to the ChEMBL team (@chembl) for both providing such valuable data under a liberal Open Access license and their critical reading of the manuscript! Additionally, I would like to stress that the ChEMBL team will create their own RDF version of ChEMBL and that this paper is not describing the version they will release.

BTW, the source of the paper is available from GitHub. And the (original) scripts to create RDF from the MySQL dump of ChEMBL are also on GitHub.


This paper outlines the RDF as it has evolved from various earlier projects. The above diagram visualizes the basic structure (red), various Linked Data resources linked too (blue) and illustrates how various ontologies are used, such as the CHEMINF, BIBO, and CiTO ontologies.
Additionally, various applications and links are described developed by various co-authors. For example, Peter worked on the use in Bio2RDF and Bin and David on Chem2Bio2RDF. Andra developed an extension for his (#altmetric) CitedIn resource, giving credit to a paper when data in it is extracted into ChEMBL. Ola, Valery, and Anthony developed a Bioclipse Decision Support extension, which supports a nearest neighbor search in ChEMBL using ChemSpider. Of course, Ola also hosts the SPARQL end point of which you can monitor the uptime at the also cool mondeca.com service:


(Yes, I think I have all the cool buzzwords covered in this paper. Sadly, marketing is needed nowadays as a scientist. Where is the time that you could rant on page after page in all your domain specific jargon, not having to worry if your reader would understand it immediately, or without a university degree...)
What this paper does not describe, is all the things I did with ChEMBL-RDF in the Open PHACTS project (@Open_PHACTS), which includes the use of QUDT and the jQUDT library for unit normalization outlined in this document and the use of VoID for link sets as described in this document. Willighagen, E.;  Waagmeester, A.;  Spjuth, O.;  Ansell, P.;   Williams, A.;  Tkachenko, V.;  Hastings, J.;  Chen, B.;  Wild, D. Journal of Cheminformatics 2013, 5, 23+.

NISO/DCMI Webinar: Semantic Mashups Across Large, Heterogeneous Institutions: Experiences from the VIVO Service

Wed, 05/08/2013 - 23:59

Categories:

RDF

2013-05-08, A NISO/DCMI Webinar with John Fereira will be held online at 1:00PM Eastern Time on 22 May 2013 (17:00 UTC - see World Clock: http://bit.ly/157qF2S). This webinar will present the perspectives of a software developer on the practicalities of building a high-quality Semantic-Web search service on existing data maintained in dozens of formats and software platforms at large, diverse institutions using Vivo, a semantic web application focused on discovering researchers and research publications in the life sciences. Using a built-in, editable ontology for describing things such as People, Courses, and Publications, data is transformed into a Semantic-Web-compliant form. VIVO provides automated and self-updating processes for improving data quality and authenticity. The webinar will highlight services that leverage the Semantic Web platform in innovative ways, e.g., for finding researchers based on the text content of a particular Web page and for visualizing networks of collaboration across institutions. Additional information can be found at http://www.niso.org/news/events/2013/dcmi/vivo/. Registration for the webinar closes 22 May 2013 at 12:00PM Eastern (16:00 UTC).

New vCard in RDF Ontology draft

Thu, 05/02/2013 - 14:53

Categories:

RDF

The Semantic Web Interest Group has published a new draft for the vCard-in-RDF Ontology, edited by Renato Iannella and James McKinney. The new draft updates the previous version by aligning it with the latest IETF vCard specification, ie, RFC6350.

This is a draft; If you wish to make comments regarding this document, please send them to semantic-web@w3.org (subscribe, archives). The goal is to publish an Interest Group Note once there is a consensus in the community.

UMBC WebBase corpus of 3B English words

Thu, 05/02/2013 - 03:55

Categories:

RDF

The UMBC WebBase corpus is a dataset of high quality English paragraphs containing over three billion words derived from the Stanford WebBase project’s February 2007 Web crawl. Compressed, its size is about 13GB. We have found it useful for building statistical language models that characterize English text found on the Web.

The February 2007 Stanford WebBase crawl is one of their largest collections and contains 100 million web pages from more than 50,000 websites. The Stanford WebBase project did an excellent job in extracting textual content from HTML tags but there are still many instances of text duplications, truncated texts, non-English texts and strange characters.

We processed the collection to remove undesired sections and produce high quality English paragraphs. We detected paragraphs using heuristic rules and only retrained those whose length was at least two hundred characters. We eliminated non-English text by checking the first twenty words of a paragraph to see if they were valid English words. We used the percentage of punctuation characters in a paragraph as a simple check for typical text. We removed duplicate paragraphs using a hash table. The result is a corpus with approximately three billion words of good quality English.

The dataset has been used in several projects. If you use the dataset, please refer to it by citing the following paper, which describes it and its use in a system that measures the semantic similarity of short text sequences.

Lushan Han, Abhay L. Kashyap, Tim Finin, James Mayfield and Johnathan Weese, UMBC EBIQUITY-CORE: Semantic Textual Similarity Systems, Proc. 2nd Joint Conference on Lexical and Computational Semantics, Association for Computational Linguistics, June 2013. (bibtex)

Open Data Conference in Seoul

Wed, 05/01/2013 - 21:25

Categories:

RDF

On the second day of Sören’s short trip to Korea, we participated in the Open Data Conference of the Information Society Agency (NIA). NIA seems to be implementing a comprehensive Open Data strategy (also involving LinkedData). Looks like South Korea is quite advanced in this regard already. In addition to my talk about Linked Open Government Data, there was also a talk by Haklae Kim about the Korean Open Knowledge foundation chapter. There were also some industry representatives (Samsung, LG) in the audience and interested in applying Linked Data in enterprise environments.

You can find pictures from the event on Facebook.

Beyond 3D printing

Wed, 05/01/2013 - 19:00

Categories:

RDF

3D printers are very much in vogue and used for everything from spectacle frames to jet engine components. They work by building up a 3D form one thin layer at a time. A variety of materials can be used depending on the desired properties of the resulting component.

I believe we should learn from nature. If you look at natural materials constructed by living organisms, it is really remarkable what has been achieved, for instance, hair, feathers, skin, teeth and bones. Insects are amazing to look at under the microscope and come in all sorts of weird forms. The structure of an insect’s antenna, or a butterfly’s wings are incredible.

The cell is a powerful molecular computer. At its heart, DNA provides the storage for the program. The human genome is said to be about three thousand million bits in size. The cell makes use of a complex set of molecules to determine which parts of the genome are being transcribed into proteins at any one time. The architecture is unlike any digital computer we are familiar with. The cell’s state is distributed across many components, and updated in complex chemical pathways. We are gradually improving our understanding of how they work together as a system.

It is now time to study how to create synthetic cells and learn how to utilize these to create complex materials we can use in a future generation of products. For this purpose, we will have to start relatively simply by studying particular subsystems without the need to fabricate the full complexity seen in living cells. This functional approach also has the great advantage of avoiding the risk of creating a new breed of organisms that can escape to the environment and replicate themselves unchecked.

The first step is to study how to create a molecular computer with DNA, RNA, ribosomes, enzymes and so forth. Can we build a system where we can design a program, translate it into DNA, and used it to switch on and off which parts of the DNA are being transcribed, and to update the state of the synthetic cell in predictable and controllable ways? Once that is achieved we could go on to develop the functional components needed to form a 3D assembler. These include counters and timers, as well as how to control the functioning of a synthetic cell according to its neighbours, or to chemical or electromagnetic gradients.

A working system would involve a means to design a program and translate it into DNA, to massively replicate this and assemble the synthetic cells from the raw ingredients, and then trigger them to start the assembly of the desired components in a carefully controlled environment. The synthetic cells would be unable to replicate themselves, and designed with only one purpose in mind.

The benefits of this approach would be the ability to create a very wide range of complex materials and forms from readily available raw materials in an energy efficient process. Today’s manufacturing processes aren’t sustainable in the long run as they use large amounts of energy and rely on materials that will increasingly be in short supply, for example, copper for electrical conductors and rare earths for electronic components and touch screens in smart phones. Biological processes by contrast make use of trace amounts of materials and as such are much more sustainable.

The time has come for a sustained programme of investment into research in molecular computing and synthetic cells. This is essential for sustaining a high standard of living as we move into a lasting era of increasingly expensive raw materials.

How Do We Attribute Data?

Tue, 04/30/2013 - 17:42

Categories:

RDF

This post is another in my ongoing series of “basic questions about open data”, which includes “

Linked (Open) Data has reached the European Publishing Industry – but is it the ‘Real Linked Data’ – a short review on the Publishers’ Forum 2013

Tue, 04/30/2013 - 16:39

Categories:

RDF

Invited by Helmut von Berg, Director at Klopotek & Partner (Klopotek is THE European vendor for publishing production software) I had the chance to participate and speak at this years Publishers’ Forum 2013 at the Concorde Hotel in Berlin on 22nd to 23rd of April 2013.

Coming from the semantic web / linked (open) data community to this publishing industry event with about 320 participants (mainly decision makers) from small to huge publishers all across Europe made me really curious in the forefront of the Forum – what would be the most important issues for innovative publishing processes, what would be the hypes and hopes of a sector that is in the middle of a big change: coming from paper publishing straight into the world of our todays’ data economy?

And  then in Berlin, Monday morning – the big surprise: already the opening keynotes by David Worlock, Outsell, UK (Title of Talk: The Atomization of Everything) and Dan Pollock, Nature Publishing Group, UK (Title of Talk: Networked Publishing is Open for Business) mentioned topics as the Semantic Web, Linked (Open) Data and even RDF and Triple Stores – last but not least pointing out that the content of publishers needs to be atomized down to the ‘data level’ and then can to be used successfully for new and innovative business models to serve existing and future customers…


David Worlock ‘singing my song’ at the Publishers’ Forum 2013

As I participated in the European Data Forum 2013 (EDF2013) just a few days before the Publishers’ Forum my first thought was: WOW – publishers today have arrived in modern data economy (following already the data value chain)! And I enjoyed talking to David Worlock in the coffee break telling him my thoughts and that I will manage a workshop about ‘Enterprise Terminology as a basis for powerful semantic services for publishers’ in the afternoon that day (see slides on slideshare) and his answer was ‘Yes Martin, it seems that I was singing your song’.

The following 1.5 days of the Publishers’ Forum 2013 were full of presentations, workshops and discussions about innovative publishing processes, new business models for publishers and innovative approaches and services – full of terms that are well known by myself like: meta data management, semantics, contextualisation and very very often: Big Data and Linked (Open) Data…..and I listened very carefully to all of this – and at some point it was clear: this discussion needs to be evaluated more carefully – because many of talks and presentations were using the above mentioned terms, principles and technologies only as marketing buzz words – but taking a deeper look showed: there is no semantic web technology in place?!

Hey, Linked Data does NOT mean to establish something like a relation / a link between ‘an Author and a publication’ inside of a repository / a database – Linked (Open) Data is a well established and specified methodology using W3C semantic web standards:

Tim Berners-Lee outlined four principles of linked data in his Design Issues: Linked Data as follows:

  • Use URIs to denote things.
  • Use HTTP URIs so that these things can be referred to and looked up (“dereferenced”) by people and user agents.
  • Provide useful information about the thing when its URI is dereferenced, leveraging standards such as RDF*, SPARQL.
  • Include links to other related things (using their URIs) when publishing data on the Web.

Please read in more detail here:

As being a bit like an evangelist for Linked (Open) Data I think such a hype can be very dangerous for the publishing industry – because I see a very strong need for these companies to go for innovative content- and data management approaches very quickly to ensure competitiveness today as well as competitive advantage tomorrow – but not using the respective standards (means: only having the packaging and marketing brochures branded with it) cannot fulfill the hopes in the mid- and the long term!

Thereby I would like to point out here that ‘Linked Data’ seems not always to be ‘Linked Data’ – and I would like to strongly recommend to take a look at the well proven standards – and when selecting IT consultants and IT vendors (means: your IT partners – also a very interesting message taken home from the Forum: that publishers and IT vendors should co-operate more closely in the future in the form of sustainable partnerships) to ensure that these partners really have worked already and are working continuously with these standards and mechanisms!

Christian Dirschl (Wolters Kluwer) presenting the
WKD Use Case on Enterprise Terminologies

Btw. I had a great workshop on Monday afternoon together with Christian Dirschl from Wolters Kluwer Germany (WKD) discussing applications on top of enterprise terminologies (controlled vocabularies using real linked (open) data principles). And: The Semantic Web Company (SWC) is already a partner of the publisher WKD – and this partnership seems to become a more and more fruitful and sustainable one every day – using real linked (open) data…