Using RDF as metadata storage
Date | 2012-10-28 |
Contact(s) | Simon Pigot |
Last edited | 2012-10-31 |
Status | draft, being discussed, in progress |
Assigned to release | Not yet assigned to a release |
Resources | Not allocated yet |
Ticket # | #XYZ |
Overview
GeoNetwork stores metadata records from different schemas as rows in a database table. To provide search, a metadata record is:
- transformed into to a common XML index document via XSLT;
- the common XML document is ingested by Lucene, which creates an index of the fields within the document;
- the Lucene index and query syntax is used for searching
The essence of this proposal is to replace (or, to begin with, add an alternative) to this as follows:
- transform the metadata record into RDF (resource description framework) facts when it is ingested by GeoNetwork
- store the RDF facts in an RDF store
- use the RDF store and the SPARQL/GeoSPARQL query language for searching these facts
Why would we do this?
- A metadata record is a collection of facts: RDF is purpose designed for representing facts and relationships between facts
- RDF stores and the SPARQL/GeoSPARQL query language are designed to query facts and relationships between facts
- Simplify the architecture of GeoNetwork: metadata would be stored and searched in the same persistence solution - the RDF store
Proposal Type
- Type: Core Change
- App: GeoNetwork
- Module: Data Manager, Search,
Links
- RDF at Wikipedia: http://en.wikipedia.org/wiki/Resource_Description_Framework
- RDF/XML W3C Document: http://www.w3.org/TR/rdf-syntax-grammar/
- RDF at W3C Schools: http://www.w3schools.com/rdf
- SKOS: http://www.w3.org/TR/skos-primer/
- Mappings from ISO metadata standards to RDF: http://def.seegrid.csiro.au/isotc211/iso19115/2003/ (mapping from ISO19115 to RDF), http://def.seegrid.csiro.au/isotc211/iso19119/2005/ (mapping from ISO19119 to RDF)
- GIT Repository of GeoNetwork Branch developed by UWA Developers: https://github.com/cipherj/core-geonetwork.git (rdf-store branch)
- Apache JENA: http://jena.apache.org/ (rdf triple store used in UWA patch)
- Geospatial reasoning for Apache JENA: http://code.google.com/p/geospatialweb/
- Geospatial reasoning for OpenRDF-sesame: https://dev.opensahara.com/projects/useekm
- More on Geospatial and Temporal Reasoning for Apache JENA: http://parliament.semwebcentral.org
Voting History
- Not proposed for voting yet.
Motivations
RDF, RDFS and OWL
RDF (Resource Description Framework) is a general method to decompose knowledge into simple facts consisting of entity-attribute-value (a triple confusingly rebadged as subject-predicate-object in RDF!). As an example in the metadata context, an RDF triple representing a fact from a metadata record could be:
urn:metadata:2d41a526-a595-4c57-a144-23ce4a922437-title-'A Vegetation Map Of Tasmania'
where the entity/subject is urn:metadata:2d41a526-a595-4c57-a144-23ce4a922437, the attribute/predicate is title and the value/object is: 'A Vegetation Map Of Tasmania'. Basically, the subject/entity and the value/object are 'things' and the attribute is the relationship between them or in english: "The title of urn:metadata:2d41a526-a595-4c57-a144-23ce4a922437 is 'A Vegetation Map of Tasmania'"
RDF has an XML implementation (see http://www.w3.org/TR/rdf-syntax-grammar). This triple encoded in RDF/XML might look like:
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:my="http://mynamespace.org.au"> <rdf:Description rdf:about="urn:metadata:2d41a526-a595-4c57-a144-23ce4a922437"> <my:title>A Vegetation Map Of Tasmania</my:title> </rdf:Description> </rdf:RDF>
We have introduced a namespace (prefix my) to distinguish our title element from those elements provided by the RDF framework (they have namespace prefix rdf). To avoid reinventing new elements in what is essentially a metadata language, RDF/XML often incorporates the (well-known) dublin core metadata elements (probably because these are flat simple elements). So we could rewrite our example above to use the dc:title element from dublin core:
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:dc="http://purl.org/dc/elements/1.1/"> <rdf:Description rdf:about="urn:metadata:2d41a526-a595-4c57-a144-23ce4a922437"> <dc:title>A Vegetation Map Of Tasmania</dc:title> </rdf:Description> </rdf:RDF>
RDFS (or RDF Schema) adds the concept of user-defined classes and defines the rules governing classes (eg. specialization) - so we can now represent classes of 'things' and the different types of relationships between these 'things'. So in our example above, if the title of the metadata record was
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#" xmlns:my="http://mynamespace.org.au"> <rdfs:Class rdf:ID="urn:metadata:2d41a526-a595-4c57-a144-23ce4a922437"> <rdfs:subClassOf rdf:resource="#citation"/> .... other classes that make up the metadata description .... </rdfs:Class> <rdfs:Class rdf:ID="citation" > <rdfs:subClassOf rdf:resource="#title"/> .... other classes that make up the metadata citation .... </rdfs:Class> <rdfs:Class rdf:ID="title"> <dc:title>A Vegetation Map Of Tasmania</dc:title> <my:alternateTitle>Tasmanian Vegetation Map from 1952</my:alternateTitle> </rdfs:Class> </rdf:RDF>
OWL (Web Ontology Language) adds inference rules to RDF/XML. A well known way of doing this is to use the SKOS schema. A simplified version of the class diagram for an owl encoding of the gmd:citation element from ISO19115/19139 is shown below (see the whole lot at http://def.seegrid.csiro.au/isotc211/iso19115/2003/citation/):
GeoNetwork currently:
- manages ISO19115 metadata records, ISO19115 metadata objects and relationships between objects and records
- maps meanings between different standards to a generic set of fields for searching
In many ways, this is replicating what can be done with RDF. But:
- it is somewhat limited to the ISO metadata standard and no single metadata standard has ever gained ascendancy over all domains eg. Taxon Concept Schema and Darwin Core are popular for biological data, Dublin Core remains popular for (which?) some applications, SDMX is used for statistical metadata and so on
- RDF is a generic approach to handling knowledge and it is specifically intended to support mappings between knowledge in different domains (eg. mappings between concepts in different metadata standards from different domains)
Proposal
- UWA patch already converts ISO19115/19139 to RDF and stores in RDF store: any issues with the patch need to be addressed (see section below)
- Extend RDF mapping to profiles of ISO and then to those standards that already have an RDF mapping (harmonize with OWL mapping by Simon Cox?)
- Implement and test searching using SPARQL/GeoSPARQL on RDF store - could be done by: adding an RDFSearcher subclass to MetaSearcher (this would complement LuceneSearcher and Z3950Searcher) and (perhaps) subclassing SearchManager into LuceneSearchManager and !RDFSearchManager. (CatalogSearcher - which is used for CSW searches - is a problem as it is not a subclass of MetaSearcher - sigh).
- More...
Brief Review of UWA Developer patch
Changes:
- web/src/main/java/org/fao/geonet/Geonetwork.java
- web/src/main/java/org/fao/geonet/kernel/DataManager.java
Adds:
- web/src/main/java/org/fao/geonet/kernel/DataManagerRDF.java
- web/src/main/webapp/xsl/rdf2xml.xsl
- web/src/main/webapp/xsl/xml2rdf.xsl
XML2RDF and RDF2XML
Object Identifiers: One of the stated key advantages of RDF is that objects are identified once and then reused. In the work done to date, I can see that Responsible Parties are assigned identifiers based on the name of the responsible person. Presumably the RDF store would refuse to ingest the same responsible party if it was repeated in another record? But there are a few problems with this approach:
- Name of responsible party may not be present (it's optional) or may be hidden
- The RDF URI is hardcoded to use a CSIRO URL - probably should be GeoNetwork or some other URL
Perhaps a safer method would be to generate an object from (say) an md5sum on the content of the object? Or if an XLink exists (ie. the content is already shared) then it could use the XLink URL.
Doesn't support XLinks?
Doesn't support parentIdentifer/aggregationInfo methods of creating links to other metadata records? How complete is the mapping from ISO19115/19139 to RDF on http://def.seegrid.csiro.au/isotc211/iso19115/2003/?
DataManagerRDF.java
Patch uses Apache JENA RDF store but GeoNetwork already uses the OpenRDF-Sesame RDF store for its SKOS vocabularies - would be better to use one of these RDF stores, not both? (Perhaps OpenRDF-Sesame doesn't support something needed here?)
DataManagerRDF adds extra methods to DataManager that (at the moment) simply store the metadata record (in RDF) in the RDF Store - this is a duplicate of the metadata stored in the database (which is fine for test purposes). These methods seem to be coded efficiently. Unfortunately the only mapping to RDF currently supported is the ISO19115/19139 mapping on http://def.seegrid.csiro.au/isotc211/iso19115/2003/. GeoNetwork also supports other metadata schemas (eg. Dublin Core, ISO profiles etc). Simple checks need to be added to DataManager to avoid calling DataManagerRDF if the schema of the metadata record does not have a mapping to RDF. This could be achieved by moving the xml2rdf.xsl and rdf2xml.xsl XSLTs from web/geonetwork/xsl to the convert dir of the iso19139 plugin schema and then implementing a check to see whether these XSLTs exist in a particular plugin schema in DataManagerRDF (this is a small change that would make the current patch work better with GeoNetwork).
Other Java Changes
Minor: to include DataManagerRDF calls in DataManager and provide init info from GeoNetwork.java - fine.
Implications for wider adoption of RDF Store and Encoding in GeoNetwork
Speed of RDF triple stores versus Lucene? Free text search in Apache JENA RDF triple store/sparql queries is supported by using Lucene to help - see LARQ sub-project: http://jena.apache.org/documentation/larq/index.html
Spatial searching: At present we do mixed spatial and textual searches for OGC CSW query support by filtering Lucene searches with query results from spatial database via GeoTools. How would this work in SPARQL? OGC GeoSPARQL would be part of the approach here I suppose: http://code.google.com/p/geospatialweb/ How mature is the GeoSPARQL implementation for Apache JENA or OpenRDF-sesame? (see links section: these projects don't look like they are mainstream parts of RDF frameworks yet?). Also, we use GeoTools to parse the OGC filter language - how would this be converted to a GeoSPARQL query? (hmm).
Relationship to DCAT proposal? This implementation already turns metadata records into RDF so that external semantic web/linked data services can obtain RDF from GeoNetwork. Do we need more RDF? The answer here is that the DCAT proposal deals only with turning GeoNetwork records into RDF for external services, this proposal is about changing GeoNetwork's internal representation to RDF. The advantage of this proposal is that GeoNetwork will be able to integrate RDF facts from external sources into its internal representation eg. linking taxonomic concepts provided by an external RDF service.
Backwards Compatibility Issues
We have begun to use Lucene as a very fast persistence in place of the database (cf for example, search service q). Need to determine whether these queries can also be run quickly against the RDF triple store.
RDF mappings for other standards? Probably most popular standards will have projects in place or ongoing to do mappings to RDF (eg. dublin core), however some of the more substantial ISO efforts (ISO19110, ISO19135 and others) are less likely to have these so could be a fair body of work to do these (that said, some of the metadata standards have some concepts mapped to ISO19115 so could be reasonably straightforward to use that mapping to the RDF for ISO19115).
New libraries added
Apache JENA - used as RDF triple store in UWA patch (but OpenRDF-Sesame is already being used in GeoNetwork).
Risks
RDF and the semantic web has been the 'coming' technology for some time now. (Skeptically) Could this be another ebRIM? Somewhat mitigated by the maturity of OpenRDF-sesame and Apache JENA and the fact that GeoNetwork already relies upon OpenRDF-sesame for vocabulary support.
Participants
- Simon Cox, CSIRO Australia - instigator and RDF mappings
- Wahhaj Ali, Tianyi Chen, Cameron Fitzgerald, Joshua Hollick, Saxon Jensen, Rebecca Papadopoulos - University of Western Australia - developed patch for trunk that uses an RDF metadata store
- Simon Pigot, CSIRO Australia and GeoNetwork PSC member
Attachments (1)
-
Citation.png
(131.6 KB
) - added by 12 years ago.
ci:Citation in owl
Download all attachments as: .zip