Version 6 (modified by 15 years ago) ( diff ) | ,
---|
Proposal title
Date | 2010/03/06 |
Contact(s) | simonp |
Last edited | Timestamp |
Status | in progress, complete |
Assigned to release | 2.5 |
Resources | Not applicable |
Overview
A number of performance issues have been found when GN is applied to catalogs containing 10s of thousands of records:
- search speed: search speed appears to degrade badly with the size of the result set, ie. searches with large result sets were taking much longer than expected to process than searches with small result sets
- indexing speed: importing 10s of thousands of records via batch import, harvesters, massive operations and CSW transactions was taking too long eg. local file system harvest of 20k records was taking 10-12 hours...
Proposal Type
- Type: Core Change
- App: GeoNetwork
- Module: Harvester, Kernel, Data Manager, Metadata Import, Lucene Index, Search results, Massive Operations and CSW Transaction support
Links
- Documents:
- Email discussions:
- Other wiki discussions:
Voting History
- Vote proposed by X on Y, result was +/-n (m non-voting members).
Motivations
Make GN perform better for catalogs with 10s of thousands of records.
Proposal
Addressing the search speed issue: Investigation found that the lucene component of searches in GeoNetwork is in fact very fast. The delay in returning search results came from processing of all results to gather the most frequently used keywords for the search summary. Since all hits were processed no matter how large the result set, large result sets could cause a big delay in the first page of search results getting back to the user. A simple fix is to limit the number of hits that are processed to build the keyword frequency info for the search summary. This parameter is specified on search services as maxRecordsInKeywordSummary and has been set to 1000, the LuceneSearcher.java code then limits the number of hits it examines to build the keyword frequency info for the search summary. 1000 is an arbitrary number that sites can change according to the number of keywords used in their metadata records and the time delay that is considered acceptable for search results to be returned to the user.
Addressing the indexing/loading issue: Investigation of this problem led to two issues and two solutions:
- Speeding up Lucene indexing: GeoNetwork was opening and closing the !Lucene IndexWriter every time it wrote a document to the search index. This is a very safe way to handle the Lucene Index Writer as only one Index Writer can be open. However the IndexWriter class in !Lucene is now much more sophisticated than it was when GeoNetwork was first written. In particular, it is thread safe and can buffer documents in RAM before writing them out to the index file on disk with its own thread. To use the IndexWriter in this way without forcing major changes in the GeoNetwork code, resulted in an IndexWriter facade that allows:
- code that intends to write a number of documents to the !Lucene Index to keep the IndexWriter open, and thus take advantage of the more sophisticated IndexWriter implementation
- multiple threads to schedule documents that need to be written to the index without blocking
- Speeding up spatial indexing: GeoNetwork uses a shapefile to hold spatial extents for searches that contain spatial queries eg. touch, intersect, contains, overlap etc. At present only the CSW service uses the spatial index for these queries, the web search interface uses boxes and !Lucene.
Backwards Compatibility Issues
Risks
Participants
- Doug Nebert, Archie Warnock
- Timo Proescholdt (speed analysis report)
Attachments (1)
- svn_5814_patch.txt (117.5 KB ) - added by 15 years ago.
Download all attachments as: .zip