|Version 26 (modified by simonp, 11 months ago)|
Improvements to the OAIPMH Harvester
|Status||draft, being discussed|
|Assigned to release||2.9|
OAIPMH Harvester in GeoNetwork needs to be enhanced to support the following:
Metadata Object harvesting
In recent times GeoNetwork has moved from supporting a simple ISO19115/19139 metadata in the form of a 'record' to supporting a set of ISO19115/19139 metadata 'objects' with hierarchical relationships between the objects. The diagram below shows an example of some objects and relationships:
The mechanisms for handling these relationships are part of the ISO standard. They can be explicit in the form of an xlink that refers directly to the related metadata object or implicit by including the UUID of a related metadata object as content in an element. Here is an example of an explicit relationship between a metadata record and a fragment of contact information that it includes:
<mcp:metadataContactInfo> <mcp:CI_Responsibility> <mcp:role> <gmd:CI_RoleCode codeList="..." codeListValue="custodian"/> </mcp:role> <mcp:party xlink:href="http://mygeonetwork.com/xml.metadata.get?uuid=urn:marine.csiro.au:marlin:person:28_person_organisation"/> </mcp:CI_Responsibility> </mcp:metadataContactInfo>
Here is an example of an implicit relationship where the UUID of the parent record in a parent-child relationship is held in the content of the parent identifier element:
<gmd:parentIdentifier> <gco:CharacterString>urn:marine.csiro.au:project:187</gco:CharacterString> </gmd:parentIdentifier>
And another example of an implicit relationship where the UUID of the sibling record in a sibling relationship between a dataset metadata record and a project metadata record (uuid: urn:marine.csiro.au:marlin:project:187) is held as a code in an identifier element:
<gmd:aggregateInformation> <gmd:MD_AggregateInformation> <gmd:aggregateDataSetIdentifier> ... <gmd:MD_Identifier> <gmd:code> <gco:CharacterString>urn:marine.csiro.au:marlin:project:187</gco:CharacterString> </gmd:code> </gmd:MD_Identifier> ... </gmd:aggregateDataSetIdentifier> <gmd:associationType> <gmd:DS_AssociationTypeCode codeList="..." codeListValue="crossReference">crossReference</gmd:DS_AssociationTypeCode> </gmd:associationType> <gmd:initiativeType> <gmd:DS_InitiativeTypeCode codeList="..." codeListValue="project">project</gmd:DS_InitiativeTypeCode> </gmd:initiativeType> </gmd:MD_AggregateInformation> </gmd:aggregateInformation>
Some Cardinality rules on ISO relationships of interest:
- A record can have one parent identifier and a parent record can have more than one child.
- A record can have more than one sibling identifier.
- A record can have one or more fragment links and a fragment can be linked into more than one record.
- A service record may operate on the datasets described by more than one record.
OAIPMH and most other harvesters are record based. ie. it is expected or assumed that a harvest will retrieve one or more metadata records. GeoNetwork's OAIPMH server returns records by resolving xlink references to metadata objects. For each xlink reference, the resolve process:
- finds the fragment of metadata referenced by the xlink (which could be local to the catalog or external to the catalog)
- copies the fragment of metadata into the record
Metadata objects that are implicitly referenced (eg. by UUID) in the content are not resolved.
One of the goals of this proposal is to provide an alternative OAIPMH harvester service that:
- retrieves metadata objects with unresolved references to other objects
- adds one instance of all referenced metadata objects to the OAIPMH harvest results
- extends the current OAIPMH implementation: the default behaviour will be to return resolved metadata records (ie. the current implementation). Referencing the alternative OAIPMH service will deliver metadata objects.
The reason for implementing this extension is to enable conversions to other metadata schemas that use metadata objects and relationships eg. ANDS RIF-CS. Note that these schemas must support a subset of the objects and relationships available in the ISO19115/19139 model.
This is a feature of the OAIPMH standard that has not been implemented in the GeoNetwork OAIPMH server but has been implemented in the OAIPMH client. To implement this capability in the GeoNetwork OAIPMH server records (and metadata objects in general) that have been deleted from the catalog should remain searchable by date range and/or set but be marked as deleted if returned in the results to an OAIPMH client.
The reason for this implementing this feature of the OAIPMH standard is that an OAIPMH client can determine which records have been deleted during an incremental/update harvest.
- Type: Core Change
- App: GeoNetwork
- Module: Kernel, Harvester, Data Manager, Lucene Index
The two components from the overview section of this proposal will be implemented as follows.
Metadata Object Harvesting
Currently, GeoNetwork stores the unresolved metadata record (ie. without linked fragments) in the database. All links to fragments are resolved before indexing the record in Lucene or returning the record as a search result. To support retrieving metadata objects one of two approaches could be used:
- Use Lucene search to retrieve records that match OAIPMH query and then process search results to add any referenced metadata objects of interest
- First search returns UUID of metadata records that match the OAIPMH query.
- Since XLinks and parent-child + sibling relationships are indexed in Lucene, the search results can be processed to collect all referenced metadata objects and add their UUIDs to the search results.
- Each object can then be retrieved from the database by UUID in unresolved form and returned as a result.
- The second approach would require metadata records and the objects that are referenced from the record to be stored as a document block in Lucene (using Lucene 3.6). The OAIPMH query would search on the metadata record but the results returned would include the referenced objects from the document block. A possible advantage of this approach over the first is that the search results would not have to be processed before they are returned which means that this approach is likely to be considerably faster than the first. A disadvantage of this approach over the first is that the Lucene indexing process in GeoNetwork would need to be substantially modified.
As RIF-CS is a metadata object standard that supports a subset of the ISO19115/19139 objects and object relationships, we can test extraction of the ISO metadata objects using the RIF-CS schema. To implement this we will need to adapt the RIF-CS XSLTs for iso19139.mcp, eml-gbif and iso19139.anzlic to handle:
- metadata objects such as person and organisation contact information as parties
- metadata records with a project hierarchy level as activities
- metadata records with dataset as collections
- project sibling relationships as related activities
Finally, we will implement an alternative oaipmh service that will use the lucene search for metadata objects + modified RIF-CS XSLTs (this should be a straightforward change to the existing oaipmh server as the code can decide whether to use the normal lucene search for metadata records or the lucene search for metadata objects based on the name of the oaipmh server service). The alternative oaipmh server could be called geonetwork/srv/eng/oaipmhMetadataObjects
The estimated time to implement this feature, including testing is 10 weeks.
OAIPMH server changes:
- Metadata records that are deleted from the GeoNetwork catalog will continue to be removed from the database table and archived as is currently implemented.
- Before removal, a metadata object will be added to a table of deleted records and indexed with status set to deleted. Normal searches on the catalog will not return these records as the default search terms will exclude records with this status (this will require a change to the default Lucene search terms).
- The OAIPMH server will advertise that it maintains deleted records in a persistent manner when answering the OAIPMH "Identify" request as per the OAI protocol document in the Links section.
- OAIPMH searches will search current and deleted records in the Lucene index - deleted records will be returned with status="deleted" as per the OAIPMH standard.
The estimated time to implement this feature, including testing is 3 weeks.
- Documents: http://www.openarchives.org/OAI/openarchivesprotocol.html#DeletedRecords
- Email discussions: ANDS agenda for meeting about OAIPMH improvements.
- Other wiki discussions:
- Not proposed for voting yet.