Refactor harvesting
Date | 2011/03/21 |
Contact(s) | Mathieu |
Last edited | |
Status | draft |
Assigned to release | probably 2.7 |
Resources |
Overview
Here is a proposal for harvesting part refactoring. This module is quite old, with a lot of duplicate code (ie. "align" methods). Moreover, the client side is quite complex with lots of Ajax query, XSL transformations on the client side, old javascript librairies used...
Proposal Type
- Type: GUI Change, Core Change
- App: GeoNetwork
- Module: Harvester, Kernel, Data Manager
Voting History
- Vote proposed by X on Y, result was +/-n (m non-voting members).
Motivations
Remove duplicate code. Use librairies up to date. Store harvesting task into specific model.
Proposal
Both server side and client will be refactored based on an object model and on JSON object. We try to think this new model as generic as it could be in order to be easily extended by any new harvester.
This new model uses specifics tables in the database to store information about harvesters (HarvestingTask, HarvestingTaskResult...).
A few more details on these tables: HarvestingTask, HarvestingTaskGroup, HarvestingTaskCategory: This table may store relevant information about an harvesting task. The tables HarvestingTaskGroup and HarvestingTaskCategory are used to apply categories and privilieges to harvested metadata.
HarvestingTaskResult: Each harvesting result is stored in a dedicated table and may be displayed in an history table. A mechanism can also be implemented to store the last N results (relevant).
HarvestingTaskConfiguration: Specific properties and values for a harvesting. Indeed, some harvesters have specific values for configuration. These ones will be stored in that table.
For now, we did a new interface based on ExtJs only for the OAI PMH harvester and we only move this harvester to the new model, there is some work to move all harvesters and to get them working into the trunk.
CREATE TABLE HarvestingTask ( id int, uuid varchar(250) not null, name varchar(32) not null, harvestingType varchar(32) not null, validationMode varchar(32) not null, isrecurrent char(1) default 'n' not null, recurrentPeriod int, lastRun varchar(24), status varchar(32) not null, isIncremental char(1) default 'n' not null, categoryid int not null, logo varchar(32), //TBD with Logo manager primary key(id), foreign key(categoryid) references Categories(id), unique(uuid) ); CREATE TABLE HarvestingTaskGroup ( harvestingTaskId int, groupId int, primary key(harvestingTaskId, groupId), foreign key(harvestingTaskId) references HarvestingTask(id), foreign key(groupId) references Groups(id) ); CREATE TABLE HarvestingTaskCategory ( harvestingTaskId int, categoryId int, primary key(harvestingTaskId, categoryId), foreign key(harvestingTaskId) references HarvestingTask(id), foreign key(categoryId) references Categories(id) ); CREATE TABLE HarvestingTaskResult ( harvestingTaskResultId int, dateResult varchar(24) not null, total int, added int, updated int, unchanged int, locallyRemoved int, unknownSchema int, unretrievable int, badFormat int, doesNotValidate int, ignored int, errors text, harvestingTaskId int, primary key(harvestingTaskResultId), foreign key(harvestingTaskId) references HarvestingTask(id) ); CREATE TABLE HarvestingTaskConfiguration ( configurationId int, attr varchar(24) not null, val varchar(250) not null, harvestingTaskId int, primary key(configurationId), foreign key(harvestingTaskId) references HarvestingTask(id) ); And also a foreign key in the Metadata table : foreign key(harvestingTask) references HarvestingTask(id),
Design Overview
A quick overview of the design :
- an AbstractHarvester class. Each harvester will extend this abstract class. No needs to have multiple classes for specific harvesters, only one is needed (ie. CSWHarvester extends AbstractHarvester).
- an HarvestingTaskManager responsible of the harvesting tasks management.
- runnable tasks, executor services and thread pooling mecanisms are based on native/generic Java concepts : Runnable (java.lang.Runnable); Executors, ScheduledExecutorService and ThreadPoolExecutor (java.util.concurrent).
- an object model (~Java bean) with the following entities : HarvestingTask, HarvestingTaskConfiguration, HarvestingTaskRunMode, HarvestingTaskStatus, AbstractMetadata(->Metadata, Template, SubTemplate), MetadataAlignerError, MetadataAlignerResult ...
- a MetadataAligner responsible of all import, update, delete of any AbstractMetadata (ie. metadata, template)
Backwards Compatibility Issues
New libraries added
JSON librairies (Jackson) http://jackson.codehaus.org/
Risks
- At the time being, only those harvesters are fully implemented (both client and server side) CSW, File system, Geonetwork, OAIPMH and Webdav.
- As this implementation is based on Geonetwork 2.6.0, what about new features compliance : ie. harvesting history, schema management
- this approach breaks out the results into separate table fields and tries to create a common result object for all the harvesters - this might be ok for almost all simpler harvesters - but it might need a little more thought for some of the other harvesters (ie. Z39.50, THREDDS and WFS Feature harvesters).
- the java object model implies lots of modifications in the sources. For instance, we do not have anymore multiple methods in the DataManager to handle metadata insert or import... the same for update and delete. We also need to introduce object such as Group, Category and dedicated managers...
Participants
- Mathieu Coudert
- Julien Baillagou