Metadata change logging/revision log

Date 2012-01-08
Contact(s) Simon Pigot
Last edited 2012-01-17T00:30:00
Status Complete - without implementation of flagging inconsistent subversion revisions
Assigned to release 2.7.x
Resources Available
Ticket # #726

Overview

There are many use cases where the complete history of changes to a metadata record and its properties (eg. privileges, categories and status) needs to be captured. This proposal adds a local filesystem subversion repository to GeoNetwork and code to capture these changes.

Proposal Type

  • Type: Core Change
  • App: GeoNetwork
  • Module: Kernel, Data Manager
  • Documents:
  • Email discussions:
  • Other wiki discussions:

Voting History

  • Proposed on 2012-01-13, Francois +1, Emanuele +1, Patrizia +1, Jeroen +1, Andrea 0

Motivations

In its current form GeoNetwork does not capture any details about changes to metadata records or properties of metadata records (eg. privileges, categories, status). Instead, only the latest version of the metadata record and its current properties are available. However there are many use cases where it is important to be able to track (over time):

  • changes to the metadata record ie. changes to individual elements
  • changes to properties of the metadata record eg. privileges, categories, status

As we use a database to hold both the metadata records and their properties, we could implement history tables to capture these changes and provide a user interface that allows the user to query the information in these tables. Alternatively, we could use a subversion repository to capture these changes and allow the user to examine the changes through the various visual interfaces to subversion repositories that already exist eg. viewvc. Apart from the advantage of ready to use tools for examining the changes, the subversion approach is efficient for XML files and simple to maintain.

Proposal

Using an open source java api to subversion from tmatesoft, we will implement change tracking for metadata records and their properties in a subversion repository created and maintained by the GeoNetwork code.

Not all records will be tracked as the compute and systems admin cost of this tracking for every record, particularly in larger catalogs is too high. Instead only those records that are edited and updated within the local GeoNetwork instance will be tracked in the subversion repository.

The database will remain the point of truth for GeoNetwork. That is, changes will be tracked in subversion, but the database will continue to be the facility used by all services. For example, although it is possible to extract the latest version of a metadata record from the subversion repository, all services will continue to extract the latest version of the metadata record from the database table.

Using a subversion repository in place of database history tables, forces us to think about maintaining both the subversion repository and the database in a consistent manner ie. committing or aborting the database and subversion repository. In developing this proposal we've examined three approaches:

  • auto commit the subversion repository after every change to the metadata record or its properties
  • apply the changes to the subversion repository as they are made, then commit/abort the subversion repository and database at the same time
  • set a flag saying that changes have been made and then at database commit, query the database and commit changes to the subversion repository

The first approach is the easiest to code particularly as regards maintaining consistency between the subversion repository and the database: if a database operation fails we don't make any changes to the subversion repository. If any of the subversion repository commits fail, then we could abort the current database commit. However, excepting the simplest operations on a single record, the changes recorded in the subversion repository will bare little or no resemblance to the changes that are made by GeoNetwork services. For example, if the user decides to change the privileges on a metadata record, this would result in more than one commit to the subversion repository (in fact the number of commits would be equal to the number of group permissions selected in the privilege interface as they are set one by one in the DataManager).

The second approach is slightly more difficult to code: subversion changes need to be bundled by keeping the subversion commit editor open and using a listener to commit/ignore the changes to the subversion repository when the database is committed/aborted. This scenario is complicated by the design of the tmatesoft api which does not allow reentrant calls on a subversion repository object and does not allow files and directories in the repository to be opened more than once in a transaction eg. as described at  http://osdir.com/ml/version-control.subversion.javasvn.user/2007-10/msg00053.html.

The third approach is the one that has been implemented. The coding is more straightforward than the second approach, only slightly more complex than the first and it captures the state of both the metadata record and its properties at database commit time.

To illustrate the third approach, let's examine a typical scenario where we wish to capture changes to the privileges of a metadata record made by a user in the 'Set Privileges' function:

  • This function ultimately calls the setOperation method in the DataManager to change the privileges for the metadata in the database.
  • In setOperation we add a call to setHistory in the SvnManager which records the id of the metadata record against the database channel.
  • Just prior to the database channel being committed at the end of the 'Set Privileges' function, the listener on the database channel reads the privileges for the metadata record and commits any changes to the subversion repository. If the subversion commit fails, then the subversion repo rolls back and the database commit is aborted and rolls back.

Note that the current transaction isolation setting for database connections from the apache database connection pool used in GeoNetwork is read-committed. The transaction level for these connections will need to be set to the more strict serialized transaction isolation level if metadata versioning is config'd on, so that changes made by one transaction to the record and its properties will not overlap with changes committed by another transaction. Note: The transaction isolation level in GeoNetwork used to be serialized before version 2.6. GeoNetwork admins who configure the database pool through JNDI will need to be warned to set the transaction isolation level to serialized in the documentation if they want to use metadata versioning.

One reviewer raised the issue of what happens if the database commit fails? The answer is that the subversion repository will be left with the results of the commit and the database will be rolled back to the commit stage. That is the database and the subversion repository will be inconsistent. A number of strategies were examined to cope with this.

The first involves looking at a two phase commit which is the strategy normally used for maintaining two or more databases in a consistent state. The idea behind a two phase database commit is that the transaction is managed by a transaction manager that does the commit to all participating databases in two stages: firstly asking all participating databases to prepare the commit and to respond when they are ready to commit and then when all are ready they are committed. For our purposes, it would be useful if we could determine when the database had finished preparing the commit as from that point onwards the database commit will almost always succeed (which is the purpose of the first stage of the two phase commit). With the database commit almost certain to succeed it would be reasonable to commit the subversion repository changes and then proceed with the second phase of the database commit if the subversion commit was successful. It doesn't seem possible to suspend the two phase commit after preparation in the javax.transaction.TransactionManager however it might be possible to do this by directly calling the two phase commit database driver that implements javax.transaction.xa.XAResource. In any case, two phase commit is not foolproof so the possibility of inconsistency is still real. (Note it is possible to use two phase commit enabled JDBC drivers for most of the databases we support in GeoNetwork, together with the apache database connection pool and an open source Transaction manager such as Atomikos - see for example  http://lafernando.com/2011/01/05/xa-transactions-with-apache-dbcp).

The second strategy is based on accepting the idea that some inconsistency is always possible and taking some action in the subversion repository to either correct the inconsistent commit (reverse-merge with the previous version) or flag the inconsistent version accordingly. This is based on the fact that we know what the version number in the subversion repository is after we have committed the changes and can hold that version number with the database connection listener until after the commit. Because we will catch any database commit failures at a very late stage (after all the SQL statements have executed successfully) its very likely that the inconsistent version we will end up with in subversion will be of interest to the user anyway so we will take the approach of flagging the version as inconsistent. This will be approach we will use.

The last question about transactions relates to how overlapping transactions are handled in subversion. The answer from  http://wiki.svnkit.com/SVNKit_FAQ#Q:_In_case_of_heavy_editing_through_calls_of_the_editor.27s_methods_it_is_possible.2C_that_another_SVN_client_took_a_commit_meanwhile.3F appears to be that the SVNKit ISVNEditor class isolates different transactions from each other whilst they are in progress but merges the changes from each on commit. If the merge does not succeed then an exception is thrown, the changes to the subversion repository in the transaction will be rolled back and the database commit will be aborted.

The metadata properties are stored in the subversion repository as XML files. The typical structure of a directory for a metadata record in the repository consists of a directory (named after the id of the metadata record) which contains:

  • metadata.xml - a record of changes to the content of the metadata record itself
  • owner.xml - an XML file describing the owner of the metadata record
  • privileges.xml - an XML file describing the privileges of the metadata record
  • categories.xml - an XML file describing the categories to which the metadata record has been assigned
  • status.xml - an XML file describing the status of the metadata (eg. Approved, Rejected, etc)

A typical example of a privileges.xml file stored in the repository:

<response>
  <record>
    <group_name>intranet</group_name>
    <operation_id>0</operation_id>
    <operation_name>view</operation_name>
  </record>
  <record>
    <group_name>sample</group_name>
    <operation_id>0</operation_id>
    <operation_name>view</operation_name>
  </record>
  <record>
    <group_name>all</group_name>
    <operation_id>0</operation_id>
    <operation_name>view</operation_name>
  </record>
  <record>
    <group_name>intranet</group_name>
    <operation_id>1</operation_id>
    <operation_name>download</operation_name>
  </record>
  <record>
    <group_name>all</group_name>
    <operation_id>1</operation_id>
    <operation_name>download</operation_name>
  </record>
  <record>
    <group_name>sample</group_name>
    <operation_id>3</operation_id>
    <operation_name>notify</operation_name>
  </record>
  <record>
    <group_name>intranet</group_name>
    <operation_id>5</operation_id>
    <operation_name>dynamic</operation_name>
  </record>
  <record>
    <group_name>all</group_name>
    <operation_id>5</operation_id>
    <operation_name>dynamic</operation_name>
  </record>
  <record>
    <group_name>intranet</group_name>
    <operation_id>6</operation_id>
    <operation_name>featured</operation_name>
  </record>
  <record>
    <group_name>all</group_name>
    <operation_id>6</operation_id>
    <operation_name>featured</operation_name>
  </record>
</response>

Difference between revisions 3 and 4 for the privileges.xml file for metadata record 10:

svn diff -r 3:4
Index: 10/privileges.xml
===================================================================
--- 10/privileges.xml   (revision 3)
+++ 10/privileges.xml   (revision 4)
@@ -1,12 +1,52 @@
 <response>
   <record>
+    <group_name>intranet</group_name>
+    <operation_id>0</operation_id>
+    <operation_name>view</operation_name>
+  </record>
+  <record>
     <group_name>sample</group_name>
     <operation_id>0</operation_id>
     <operation_name>view</operation_name>
   </record>
   <record>
+    <group_name>all</group_name>
+    <operation_id>0</operation_id>
+    <operation_name>view</operation_name>
+  </record>
+  <record>
+    <group_name>intranet</group_name>
+    <operation_id>1</operation_id>
+    <operation_name>download</operation_name>
+  </record>
+  <record>
+    <group_name>all</group_name>
+    <operation_id>1</operation_id>
+    <operation_name>download</operation_name>
+  </record>
+  <record>
     <group_name>sample</group_name>
     <operation_id>3</operation_id>
     <operation_name>notify</operation_name>
   </record>
+  <record>
+    <group_name>intranet</group_name>
+    <operation_id>5</operation_id>
+    <operation_name>dynamic</operation_name>
+  </record>
+  <record>
+    <group_name>all</group_name>
+    <operation_id>5</operation_id>
+    <operation_name>dynamic</operation_name>
+  </record>
+  <record>
+    <group_name>intranet</group_name>
+    <operation_id>6</operation_id>
+    <operation_name>featured</operation_name>
+  </record>
+  <record>
+    <group_name>all</group_name>
+    <operation_id>6</operation_id>
+    <operation_name>featured</operation_name>
+  </record>
 </response>

Examination of this diff file shows that privileges for the 'All' and 'Intranet' groups have been added between revision 3 and 4 - in short, the record has been published.

Here is an example of a change that has been made to a metadata record:

svn diff -r 2:3
Index: 10/metadata.xml
===================================================================
--- 10/metadata.xml     (revision 2)
+++ 10/metadata.xml     (revision 3)
@@ -61,7 +61,7 @@
     </gmd:CI_ResponsibleParty>
   </gmd:contact>
   <gmd:dateStamp>
-    <gco:DateTime>2012-01-10T01:47:51</gco:DateTime>
+    <gco:DateTime>2012-01-10T01:48:06</gco:DateTime>
   </gmd:dateStamp>
   <gmd:metadataStandardName>
     <gco:CharacterString>ISO 19115:2003/19139</gco:CharacterString>
@@ -85,7 +85,7 @@
       <gmd:citation>
         <gmd:CI_Citation>
           <gmd:title>
-            <gco:CharacterString>Template for Vector data in ISO19139 (preferr
ed!)</gco:CharacterString>
+            <gco:CharacterString>fobblers foibblers</gco:CharacterString>
           </gmd:title>
           <gmd:date>
             <gmd:CI_Date>

This example shows that the editor has made a change to the title and the dateStamp.

Note these examples were created using command line subversion tools. The viewvc subversion repository tool has a graphical interface that allows side-by-side comparison of changes/differences between files:

All XML files describing the properties of the metadata record are generated by SELECT statements on the relevant tables in the database.

Finally, as mentioned above, metadata records are not automatically versioned as this would impose too many overheads and is not necessary (eg. for harvested record sets). So this proposal also adds the capability for the user to select a set of records or a single record to be versioned. These interfaces are available using the usual methods below:

Metadata fragments (from directories local to GeoNetwork or from external URLs on the internet) can be linked into metadata records to support reuse.

The current patch will support versioning resolved records only. This means that all metadata fragments (both local and external) will be resolved before the record is versioned.

The user will be able to switch on versioning for fragments held in directories in the local GeoNetwork catalog in much the same way as they can for metadata records (see user interface examples above). At the moment, a change made to a local fragment will not force a new version of any record that links this fragment. Instead these changes will be picked up next time the record or its properties are changed.

Patch

Patch attached to ticket #726

Backwards Compatibility Issues

None - this function can be configured off if not required.

New libraries added

tmatesoft subversion api in Java

<!-- svnkit stuff -->
      <dependency>
        <groupId>org.tmatesoft.svnkit</groupId>
        <artifactId>svnkit</artifactId>
        <version>1.3.6-v1</version>
      </dependency>

This jar is available from the maven repository at  http://maven.tmatesoft.com/content/repositories/releases.

Risks

Risks?

Participants

  • Simon Pigot

Attachments