Changes between Version 2 and Version 3 of PerformanceEnhancements2


Ignore:
Timestamp:
Mar 29, 2010, 2:47:56 AM (14 years ago)
Author:
simonp
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • PerformanceEnhancements2

    v2 v3  
    4141      * code that intends to write a number of documents to the Lucene Index to keep the !IndexWriter open, and thus take advantage of the more sophisticated !IndexWriter implementation
    4242      * multiple threads to schedule documents that need to be written to the index without blocking
    43       * with a ref counting IndexWriter factory, it is possible to move optimization of the Lucene Index for search speed to a separate thread. At present optimization takes place after a time interval (1 second) or 10 updates to the index have been made and the cost of the optimization will slow down whatever operation happens to trigger it when the index writer is closed. For catalogs with 10s of thousands of records, optimizing the search index for maximum search speed is a very costly operation. In a separate thread the cost of the operation is less visible to search operations and with a config option (see the Administration->System Configuration menu) the frequency of optimizations can be controlled.
     43      * with a ref counting !IndexWriter factory, it is possible to move optimization of the Lucene Index for search speed to a separate thread. At present optimization takes place after a time interval (1 second) or 10 updates to the index have been made and the cost of the optimization will slow down whatever operation happens to trigger it when the index writer is closed. For catalogs with 10s of thousands of records, optimizing the search index for maximum search speed is a very costly operation. In a separate thread the cost of the operation is less visible to search operations and with a config option (see the Administration>System Configuration menu) the frequency of optimizations can be controlled.
    4444   
    45  * Speeding up spatial indexing using PostGIS: !GeoNetwork uses a shapefile to hold spatial extents for searches that contain spatial queries eg. touch, intersect, contains, overlap etc. At present only the CSW service uses the spatial index for these queries, the web search interface uses boxes and ranges in Lucene. The spatial index needs to be maintained when records are added and deleted through import, harvesting, massive delete etc. Unfortunately the shapefile is not efficient for this purpose as the number of records in the catalog goes over 40,000 odd. In particular as the mechanism for deleting extents from the shapefile uses an attribute of the extent and these cannot be indexed. This means that there is a considerable cost for maintenance operations on the shapefile. To support fast maintenance and search of the spatial index for larger catalogs, it was decided to adopt the PostGIS implementation for the spatial index written for the geocat.ch sandbox by Jessie Eichar and to fall back to using a shapefile when the catalog is not using PostGIS for its database. An option has been added to GAST to allow the user to specify PostGIS as the database (the spatial index table will be built by the create-db-postgis.sql script when the Database->Setup option is used). When !GeoTools 2.6.x is adopted, we will very likely be able to also allow the spatial index in Oracle for those who must use that.
     45 * Speeding up spatial indexing using PostGIS: !GeoNetwork uses a shapefile to hold spatial extents for searches that contain spatial queries eg. touch, intersect, contains, overlap etc. At present only the CSW service uses the spatial index for these queries, the web search interface uses boxes and ranges in Lucene. The spatial index needs to be maintained when records are added and deleted through import, harvesting, massive delete etc. Unfortunately the shapefile is not efficient for this purpose as the number of records in the catalog goes over 40,000 odd. In particular as the mechanism for deleting extents from the shapefile uses an attribute of the extent and these cannot be indexed. This means that there is a considerable cost for maintenance operations on the shapefile. To support fast maintenance and search of the spatial index for larger catalogs, it was decided to adopt the PostGIS implementation for the spatial index written for the geocat.ch sandbox by Jessie Eichar and to fall back to using a shapefile when the catalog is not using PostGIS for its database. An option has been added to GAST to allow the user to specify PostGIS as the database (the spatial index table will be built by the create-db-postgis.sql script when the Database>Setup option is used). With the adoption of !GeoTools 2.6.2, we should be able to also allow the spatial index in Oracle for those who must use that (not present as yet).
    4646
    4747The net result of these two fixes is much faster load, harvest, reindex and massive operations in !GeoNetwork. For example, in one case doing a file system harvest of 20,000 records was taking 10-12 hours without these modifications. With the modifications described in this proposal, the same harvest now takes approx 30 minutes. 
     
    5050
    5151=== Backwards Compatibility Issues ===
    52  * Single (or a few records?) transaction in CSW needs to be examined to make sure its not slower
     52 * None known
    5353
    5454== Risks ==