Opened 14 years ago

Closed 14 years ago

#336 closed enhancement (fixed)

Lucene / improve configuration

Reported by: Fxp Owned by: geonetwork-devel@…
Priority: minor Milestone:
Component: General Version: v2.6.1
Keywords: Cc:

Description

Currently GeoNetwork only has configuration information about Lucene tokenized fields. Lots of parameters are set in the Java code. This could be improved externalizing some parameters (eg. RAMBufferSizeMB, Standard analyzer, Per field analyzer, ...).

Draft of configuration file for discussion:

<?xml version="1.0"?>
<config>
	<index>
		<!-- 
			The amount of memory to be used for buffering documents in memory.
			48MB seems to be plenty for running at least two long 
			 indexing jobs (eg. importing 20,000 records) and keeping disk 
			 activity for lucene index writing to a minimum.
		-->
		<RAMBufferSizeMB>48.0d</RAMBufferSizeMB>
		
		<!-- Determines how often segment indices are merged by addDocument(). -->
		<MergeFactor>10</MergeFactor>
		
		<!-- Default Lucene version to use (mainly for Analyzer creation). -->
		<luceneVersion>29</luceneVersion>
	</index>

	
    <!-- Search parameters are applied at search time and does not need
    an index rebuild in order to be take into account. -->
    <search>
        <!--
            By default Lucene compute score according to search criteria
            and the corresponding result set and their index content.
            In case of search with no criteria, Lucene will return top docs
            in index order (because none are more relevant than others).
            
            In order to change the score computation, a boost function could
            be define. Boosting query needs to be loaded in classpath.
            * RecencyBoostingQuery will promote recently modified documents
        
		<boostQuery name="org.fao.geonet.kernel.search.function.RecencyBoostingQuery">
			<Param name="multiplier" type="double" value="2.0"/>
            <Param name="maxDaysAgo" type="int" value="365"/>
            <Param name="dayField" type="java.lang.String" value="_changeDate"/>
            </boostQuery>
    	-->
	</search>


	<!-- Default analyzer to use for all fields not defined in the fieldSpecificAnalyzer section.
		If not set, GeoNetwork use a default per field analyzer (ie. fieldSpecificAnalyzer is not
		take into account).
		
		Example:
		org.apache.lucene.analysis.fr.FrenchAnalyzer
	-->
	<defaultAnalyzer name="org.apache.lucene.analysis.standard.StandardAnalyzer">
		<Param name="version" type="org.apache.lucene.util.Version"/>		
	</defaultAnalyzer>
	
	<!-- TODO: Add a language specific analyzer -->
	
	
	<!-- Field analyzer
		Define here specific analyzer for each fields stored in the index
		
		For example adding a different analyzer for any (ie. full text search) 
		could be better than a standard analyzer which has a particular way of 
		creating tokens.
		
		In that situation, when field is "mission AD-T" is tokenized to "mission" "ad" & "t"
		using StandardAnalyzer. A WhiteSpaceTokenizer tokenized to "mission" "AD-T"
		which could be better in some situation. But when field is "mission AD-34T" is tokenized 
		to "mission" "ad-34t" using StandardAnalyzer due to number.
		
		doeleman: UUID must be case insensitive, as its parts are hexadecimal numbers which
		are not case sensitive. StandardAnalyzer is recommended for UUIDS.
		
		A list of analyzer is available http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/analysis/Analyzer.html
		Commons analyzer:
		* org.apache.lucene.analysis.standard.StandardAnalyzer
		* org.apache.lucene.analysis.WhitespaceAnalyzer
		* 
		The analyzer must be in the classpath.
		
	-->
	<fieldSpecificAnalyzer>
		<Field name="_uuid" analyzer="org.apache.lucene.analysis.SimpleAnalyzer"/>
		<Field name="parentUuid" analyzer="org.apache.lucene.analysis.SimpleAnalyzer"/>
		<Field name="operatesOn" analyzer="org.apache.lucene.analysis.SimpleAnalyzer"/>
		<Field name="operatesOnIdentifier" analyzer="org.apache.lucene.analysis.SimpleAnalyzer"/>
		
		<Field name="any" analyzer="org.apache.lucene.analysis.standard.StandardAnalyzer">
			<Param name="version" type="org.apache.lucene.util.Version"/>
			<!--<Param name="stopWords" type="java.io.File" value="/path/to/stopwords/stopwords.txt"/>-->
		</Field>
		<Field name="subject" analyzer="org.apache.lucene.analysis.KeywordAnalyzer"/>
	</fieldSpecificAnalyzer>

	<!-- All Lucene fields that are tokenized must be kept here because it
       is impossible unfortunately from Lucene API to work out which fields are
		 	 tokenized and which aren't unless we read documents and we may not have
		 	 an index to do this on so since most fields are not tokenized we 
		 	 keep a list of tokenized fields here
	 -->
	<tokenized>
		<Field name="any"/>
		<Field name="abstract"/>
		<Field name="title"/>
		<Field name="altTitle"/>
		<Field name="inspiretheme"/>
		<Field name="keywordType"/>
		<Field name="orgName"/>
		<Field name="specificationTitle"/>
		<Field name="levelName"/>
		<!-- from SearchManager/static -->
		<Field name="_uuid"/>
		<Field name="parentUuid"/>
		<Field name="operatesOn"/>
		<Field name="subject"/>
	</tokenized>

</config>

Proposed patch has the same level of functionality than current version +

  • RecencyBoostingQuery (disabled by default): to promote newly added records
  • Reload configuration option: to reload a modified Lucene configuration file (no restart required)

Review and comments welcomed.

Attachments (3)

lucene-config.patch (57.0 KB ) - added by Fxp 14 years ago.
lucene-config-r6632.patch (58.6 KB ) - added by mcoudert 14 years ago.
lucene-config-r6632-with-score.patch (64.7 KB ) - added by Fxp 14 years ago.

Download all attachments as: .zip

Change History (7)

by Fxp, 14 years ago

Attachment: lucene-config.patch added

comment:1 by Fxp, 14 years ago

Having such a configuration file could help to add and configure new Analyzer like the recently added (and default one) GeoNetworkAnalyzer : http://osgeo-org.1803224.n2.nabble.com/SF-net-SVN-geonetwork-6615-trunk-web-src-main-java-org-fao-geonet-kernel-search-td5592349.html#a5592349

by mcoudert, 14 years ago

Attachment: lucene-config-r6632.patch added

comment:2 by Fxp, 14 years ago

Add an option for scoring (see #341).

 <search>
    	<!-- Score parameters. Turning these parameters to true, affects performance. -->
    	<!-- Set track doc score to true if score needs to be displayed in results using 
    		geonet:info/score element -->
    	<trackDocScores>false</trackDocScores>
    	<trackMaxScore>false</trackMaxScore>
    	
    	<!-- Not used because no Scorer defined -->
    	<docsScoredInOrder>false</docsScoredInOrder>

comment:4 by Fxp, 14 years ago

Resolution: fixed
Status: newclosed
Note: See TracTickets for help on using tickets.