Multilingual element indexing mechanism
Date | 2008/08/13 |
Contact(s) | fxprunayre |
Last edited | Timestamp |
Status | draft |
Assigned to release | to be determined |
Resources | Resource available |
Overview
Multilingual element indexing mechanism to allow index information with specific language parameters (stopwords list, analyzer). One index per language is created.
Metadata record have:
- one main language (for iso, fgdc, dc)
- some elements in multiple languages (only iso)
In order to improve search result, a filter that replaces accented characters in the ISO Latin 1 character set (ISO-8859-1) by their unaccented equivalent. The case will not be altered (see org.apache.lucene.analysis.ISOLatin1AccentFilter).
Proposal Type
- Type: Index
- App: GeoNetwork
- Module: Lucene Index
Links
- Documents:
- Email discussions:
- Other wiki discussions:
Voting History
- Vote proposed by X on Y, result was +/-n (m non-voting members).
Motivations
Index multilingual catalogue and multilingual metadata with Lucene and improve index mechanism and search results.
Proposal
Index structure options:
- 1.one index for all languages
- 2.one index for all languages, with an extra language field so searches can be constrained to a particular language
- 3.separate indices for each language
Option 3 is choosen for the following:
- allow specific analyzer to be used for each language (stopword list, tokenizer, analyzer)
- allow to have catalogue content based on European and non european language
On index for each language is created in WEB-INF/lucene/nonspatial_{iso3 language code}.
When indexing a document :
- define which is the main language (ie. for iso defined in gmd:language)
- apply index-fields to extract non language specific information (eg. date, codelist) and information in default language.
- apply language-index-fields to extract non language specific information (eg. date, codelist) and for each other language declared in gmd:locale, extract language specific information.
A same document could be in different index. In some situation, a search could return duplicates so a DuplicateFilter is added to the search.
In general, if user search in a user interface in english, then english index is boost, to return on top information from that index.
For french and german, FrenchAnalyzer and GermanAnalyzer are used to write information to the index. If not a StandardAnalizer is used.
Known issue
- Sort by title option will not work properly when a record is found in an index and its title is not in that index.
Backwards Compatibility Issues
Risks
Participants
- Jesse
- Francois
More info
Option 1 is the option used by GeoNetwork :
- <language>fra
- <title>cours d'eau du canton de Genève
- <keyword>cours d'eau
- <any>fiume
- <any>rivers
- <any>cours d'eau
- <any>Geneva
- <any>canton
Requirements add an index field on the language.
Option 2 could allow Term search in one language: index fields structure for one document could be :
- <language>fra
- <title>cours d'eau du canton de Genève
- ...
- <title_eng>rivers in Geneva canton
- <title_ita>fiume en Geneva canton
- <keyword>cours d'eau
- <keyword_ita>fiume
- <any>fiume
- <any>rivers
- <any>cours d'eau
- <any>Geneva
- <any>canton
But for option 2, European and non european language (Chinese, Japanese, Korean) could be mixed in the indexed but search results could be inconsistent because of the analyser which have to be different. Storing field with a tag for lang will cause troubles on BooleanQuery creation Advanced Lucene functionnality could not be used (e.g. stop word list)
Using option3, the main language define in which index to store the record. Index content will be similar to option1 but specific Analyzer and language specific option could be set up. From a community perspective we should probably focus on an implementation of option 3 with a MultiSearcher support based on on index by language in order to improve support of multilingual catalogue. Option 3 will help implementation of « narrow your search » functionnality (stopword list, ...).
Allowing advanced functionnality is not used for now in GeoNetwork (Stop words, scoring over multiple language, multi analyzer support)..
Index
One index by language is created and specific language analyzer could be define (eg. FrenchAnalyzer, GermanAnalyzer provided by Lucene). Lucene index is stored in WEB-INF/lucene directory :
lucene +-- nonspatial_eng +-- nonspatial_fra +-- nonspatial_deu
Metadata indexing is done in default language using index-fields.xsl and multilingual content using language-index-fields.xsl which will extract all fragments to be stored in index.
More details on indexing mechanism MultilingualIndexMechanism .
Search
Search is done using a MultiSearcher (ie. in all index) and the index corresponding to GUI language is "boost" to be on top. A duplicate filter filter search result in order to not to have duplicate in results as one record will appear in more than on index.