Performance Improvements for catalogs with 10s of thousands of records

Date 2010/03/06
Contact(s) simonp
Last edited

Error: Failed to load processor Timestamp
No macro or processor named 'Timestamp' found

Status in progress, complete
Assigned to release 2.5
Resources Not applicable

Overview

A number of performance issues have been found when GeoNetwork is applied to catalogs containing 10s of thousands of records:

  • search speed: search speed appears to degrade surprisingly with the size of the result set, ie. searches with large result sets were taking a long time to return whereas a search on the same catalog that returns a small result set was very fast - this was surprising because in both cases the search only needed to return the first page of results
  • indexing speed: importing 10s of thousands of records via batch import, harvesters, massive operations and CSW transactions was taking too long eg. local file system harvest of 20k records was taking 10-12 hours...
  • startup delay: starting up GeoNetwork invokes a check that the Lucene index has the latest content from the database in it. For many thousands of records, startup time can be over 10 minutes....

Proposal Type

  • Type: Core Change
  • App: GeoNetwork
  • Module: Harvester, Kernel, Data Manager, Metadata Import, Lucene Index, Search results, Massive Operations and CSW Transaction support

Voting History

  • Proposed for Voting on March 29, 2010 - Motion passed: Jeroen, Emanuele, Francois, Simon

Motivations

Make GN perform better for catalogs with 10s of thousands of records.

Proposal

Addressing the search speed issue: Investigation found that the Lucene component of searches in GeoNetwork is in fact very fast. The delay in returning search results came from processing of all results to gather the most frequently used keywords for the search summary. Since all hits were processed no matter how large the result set, large result sets could cause a big delay in the first page of search results getting back to the user. A simple fix is to limit the number of hits that are processed to build the keyword frequency info for the search summary. This parameter is specified on search services as maxRecordsInKeywordSummary and has been set to 1000, the LuceneSearcher.java code then limits the number of hits it examines to build the keyword frequency info for the search summary. 1000 is an arbitrary number that sites can change according to the number of keywords used in their metadata records and the time delay that is considered acceptable for search results to be returned to the user.

Addressing the indexing/loading issue: Investigation of this problem led to two issues:

  • Speeding up Lucene indexing: GeoNetwork was opening and closing the Lucene IndexWriter every time it wrote a document to the search index. This is a very safe way to handle the Lucene Index Writer as only one Index Writer can be open. However the IndexWriter class in Lucene is now much more sophisticated than it was when GeoNetwork was first written. In particular, it is thread safe and can buffer documents in RAM before writing them out to the index file on disk with its own thread. To use the IndexWriter in this way without forcing major changes in the GeoNetwork code, resulted in an IndexWriter facade that allows:
    • code that intends to write a number of documents to the Lucene Index to keep the IndexWriter open, and thus take advantage of the more sophisticated IndexWriter implementation
    • multiple threads to schedule documents that need to be written to the index without blocking
    • with a ref counting IndexWriter factory, it is possible to move optimization of the Lucene Index for search speed to a separate thread. At present optimization takes place after a time interval (1 second) or 10 updates to the index have been made and the cost of the optimization will slow down whatever operation happens to trigger it when the index writer is closed. For catalogs with 10s of thousands of records, optimizing the search index for maximum search speed is a very costly operation. In a separate thread the cost of the operation is less visible to search operations and with a config option (see the Administration>System Configuration menu) the frequency of optimizations can be controlled.

  • Speeding up spatial indexing using PostGIS: GeoNetwork uses a shapefile to hold spatial extents for searches that contain spatial queries eg. touch, intersect, contains, overlap etc. At present only the CSW service uses the spatial index for these queries, the web search interface uses boxes and ranges in Lucene. The spatial index needs to be maintained when records are added and deleted through import, harvesting, massive delete etc. Unfortunately the shapefile is not efficient for this purpose as the number of records in the catalog goes over 40,000 odd. In particular as the mechanism for deleting extents from the shapefile uses an attribute of the extent and indexed access to these attributes has not yet been integrated with geotools. This means that there is a considerable cost for maintenance operations on the shapefile. To support fast maintenance and search of the spatial index for larger catalogs, it was decided to adopt the PostGIS implementation for the spatial index written for the geocat.ch sandbox by Jessie Eichar and to fall back to using a shapefile when the catalog is not using PostGIS for its database. An option has been added to GAST to allow the user to specify PostGIS as the database (the spatial index table will be built by the create-db-postgis.sql script when the Database>Setup option is used). With the adoption of GeoTools 2.6.2, we should be able to also allow the spatial index in Oracle for those who must use that (not present as yet).

The net result of these two fixes is much faster load, harvest, reindex and massive operations in GeoNetwork. For example, in one case doing a file system harvest of 20,000 records was taking 10-12 hours without these modifications. With the modifications described in this proposal, the same harvest now takes approx 30 minutes.

Addressing the startup delay issue: when a profiler is set up to watch the DataManager init method, it was found that an incredible amount of time was being spent in processing the list of JDOM Elements returned from the database. Changing the code to retrieve a number child Element speeds up this code approx 10x. eg. Startup delay for GeoNetwork with approximately 160,000 records drops from approx 500-600 seconds to 50 seconds. Also, the memory requirements of this method were reduced by returning just the contents of the index _changeDate field and speed was helped a little by using Lucene field selectors to retrieve only the _changeDate field from the Documents in the index.

Backwards Compatibility Issues

  • None known

Risks

  • Lucene Index corruption if the IndexWriter thread safe implementation has bugs?

Participants

  • Doug Nebert, Archie Warnock and team - testing and reporting search speed
  • Craig Jones and eMII team - testing and reporting search speed and index writing issues
  • Timo Proescholdt - provided some helpful timing analysis for the search problem
  • Jose Garcia - provided some timing of Lucene Index Writer speed ups and provided a design pattern example to improve implementation
  • geocat.ch developers provided changes to spatial index code necessary to support PostGIS
  • Francois-Xavier Prunayre for some additional fixes

Attachments