= Using RDF as metadata storage = || '''Date''' || 2012-10-28 || || '''Contact(s)''' || Simon Pigot || || '''Last edited''' || 2012-10-29 || || '''Status''' || draft, being discussed, in progress || || '''Assigned to release''' || Not yet assigned to a release || || '''Resources''' || Not allocated yet || || '''Ticket #''' || #XYZ || == Overview == !GeoNetwork stores metadata records from different schemas as rows in a database table. To provide search, a metadata record is: * transformed into to a common XML index document via XSLT; * the common XML document is ingested by Lucene, which creates an index of the fields within the document; * the Lucene index and query format is used for searching The essence of this proposal is to change this process as follows: * transform the metadata record into RDF (resource description framework) facts when it is ingested by !GeoNetwork * store the RDF facts in an RDF store * use the RDF store and the SPARQL/GeoSPARQL query language for searching these facts Why would we do this? * A metadata record is a collection of facts: RDF is purpose designed for representing facts and relationships between facts * RDF stores and the SPARQL/GEOSPARQL query language are designed to query facts and relationships between facts * Simplify the architecture of !GeoNetwork: metadata would be stored and searched in the same persistence solution - the RDF store === Proposal Type === * '''Type''': Core Change * '''App''': !GeoNetwork * '''Module''': Data Manager, Search, === Links === * '''RDF at Wikipedia''': http://en.wikipedia.org/wiki/Resource_Description_Framework * '''RDF/XML W3C Document''': http://www.w3.org/TR/rdf-syntax-grammar/ * '''Mappings from ISO metadata standards to RDF''': http://def.seegrid.csiro.au/isotc211/iso19115/2003/ (mapping from ISO19115 to RDF), http://def.seegrid.csiro.au/isotc211/iso19119/2005/ (mapping from ISO19119 to RDF) * '''GIT Repository of GeoNetwork Branch developed by UWA Developers''': https://github.com/cipherj/core-geonetwork.git (rdf-store branch) * '''Apache JENA''': http://jena.apache.org/ (rdf triple store used in UWA patch) * '''Geospatial reasoning for Apache JENA''': http://code.google.com/p/geospatialweb/ * '''Geospatial reasoning for OpenRDF-sesame''': https://dev.opensahara.com/projects/useekm * '''More on Geospatial and Temporal Reasoning for Apache JENA''': http://parliament.semwebcentral.org === Voting History === * Not proposed for voting yet. ---- == Motivations == === RDF, RDFS and OWL === RDF (Resource Description Framework) is a general method to decompose knowledge into simple facts consisting of entity-attribute-value (a triple known as subject-predicate-object in RDF terms). As an example in the metadata context, an RDF triple representing a fact from a metadata record could be: {{{ dataset-title-'A Vegetation Map Of Tasmania' }}} where the entity/subject is dataset, the attribute/predicate is title and the value/object is the literal: 'A Vegetation Map Of Tasmania'. RDF has an XML implementation (see Links section above). This triple encoded in RDF/XML would look like: {{{ A Vegetation Map Of Tasmania }}} RDFS (or RDF Schema) adds a class structure to RDF triples. A lot of resources in the !GeoNetwork space have been devoted to managing ISO19115 metadata records, ISO19115 metadata objects and relationships between objects and records. In many ways, exploring RDF is an attempt to achieve the same outcomemeans that metadata records would be broken down into small facts that can be used and reused in any metadata record (not just an ISO metadata record) - history shows that no single metadata standard has ever gained ascendancy over all domains eg. Taxon Concept Schema and Darwin Core are popular for biological data, Dublin Core remains popular for (which?) some applications, SDMX is used for statistical metadata and so on - using a generic fact representation seems to be a more worthwhile and sustainable approach == Proposal == TBA === Issues with current patch === ==== XML2RDF ==== Object Identifiers: One of the stated key advantages of RDF is that objects are identified once and then reused. In the work done to date, I can see that Responsible Parties are assigned identifiers based on the name of the responsible person . Presumably the RDF store would refuse to ingest the same responsible party if it was repeated in another record. But there are a few problems with this approach: - Name of responsible party may not be present (it's optional) or may be hidden - The RDF URI is hardcoded to use a CSIRO URL - probably should be GeoNetwork or some other URL - Not sure that the SKOS elements are encoded correctly? Perhaps the object identifier could be derived from an md5sum on the content of the object or it may use the XLink URL if the fragment being processed is XLinked. Doesn't support XLinked records? Patch uses Apache JENA RDF store but GeoNetwork already uses the OpenRDF-Sesame RDF store for its SKOS vocabularies - would be better to use one of these RDF stores, not both? When storing metadata records in the database table, they are allocated an integer ID in DataManager. I need to check that the DataManagerRDF code continues this practice otherwise the permissions system and other parts of GeoNetwork will not be able to manage the record properly. ==== Implications for wider adoption of RDF Store and Encoding in GeoNetwork ==== Profile support in ISO19115 mapping: introduce additional rdf namespaces/concepts? Speed of RDF triple stores versus Lucene? Free text search in Apache JENA RDF triple store/sparql queries is supported by using Lucene to help - see LARQ sub-project: http://jena.apache.org/documentation/larq/index.html Spatial searching: At present we do mixed spatial and textual searches for OGC CSW query support by filtering Lucene searches with query results from spatial database via GeoTools. How would this work in SPARQL? OGC GeoSPARQL would be part of the approach here I suppose: http://code.google.com/p/geospatialweb/ How mature is the GeoSPARQL implementation for Apache JENA or OpenRDF-sesame? (see links section: these projects don't look like they are mainstream parts of RDF frameworks yet?). Also, we use GeoTools to parse the OGC filter language - how would this be converted to a GeoSPARQL query? (hmm). Relationship to DCAT proposal? === Backwards Compatibility Issues === We have begun to use Lucene as a very fast persistence in place of the database (cf for example, search service q). Need to determine whether these queries can also be run quickly against the RDF triple store. RDF mappings for other standards? Probably most popular standards will have projects in place or ongoing to do mappings to RDF (eg. dublin core), however some of the more substantial ISO efforts (ISO19110, ISO19135 and others) are less likely to have these so could be a fair body of work to do these (that said, some of the metadata standards have some concepts mapped to ISO19115 so could be reasonably straightforward to use that mapping to the RDF for ISO19115). ? === New libraries added === Apache JENA - used as RDF triple store in UWA patch (but OpenRDF-Sesame is already being used in !GeoNetwork). == Risks == RDF and the semantic web has been the 'coming' technology for some time now. Cynically speaking, could this be another ebRIM? Somewhat mitigated by the maturity of OpenRDF-sesame and Apache JENA and the fact that !GeoNetwork already relies upon OpenRDF-sesame for vocabulary support. == Participants == * Simon Cox, CSIRO Australia - instigator and RDF mappings * Wahhaj Ali, Tianyi Chen, Cameron Fitzgerald, Joshua Hollick, Saxon Jensen, Rebecca Papadopoulos - University of Western Australia - developed patch for trunk that uses an RDF metadata store * Simon Pigot, CSIRO Australia and !GeoNetwork PSC member