Version 2 (modified by 16 years ago) ( diff ) | ,
---|
Multilingual element indexing mechanism
Date | 2008/08/13 |
Contact(s) | fxprunayre |
Last edited | Timestamp |
Status | draft |
Assigned to release | to be determined |
Resources | Resource available |
Overview
Multilingual element indexing mechanism.
Metadata record have:
- one main language (for iso, fgdc, dc)
- some elements in multiple languages (only iso)
- some elements stored as ref using Xlink (which could be multilingual like keywords)
In order to improve search result, a filter that replaces accented characters in the ISO Latin 1 character set (ISO-8859-1) by their unaccented equivalent. The case will not be altered (see org.apache.lucene.analysis.ISOLatin1AccentFilter).
Proposal Type
- Type: Index
- App: GeoNetwork
- Module: Lucene Index
Links
- Documents:
- Email discussions:
- Other wiki discussions:
Voting History
- Vote proposed by X on Y, result was +/-n (m non-voting members).
Motivations
Proposal
Index structure options:
- 1.one index for all languages
- 2.one index for all languages, with an extra language field so searches can be constrained to a particular language
- 3.separate indices for each language
Option 1 is the option used by GeoNetwork :
- <language>fra
- <title>cours d'eau du canton de Genève
- <keyword>cours d'eau
- <any>fiume
- <any>rivers
- <any>cours d'eau
- <any>Geneva
- <any>canton
Requirements add an index field on the language.
Option 2 could allow Term search in one language: index fields structure for one document could be :
- <language>fra
- <title>cours d'eau du canton de Genève
- ...
- <title_eng>rivers in Geneva canton
- <title_ita>fiume en Geneva canton
- <keyword>cours d'eau
- <keyword_ita>fiume
- <any>fiume
- <any>rivers
- <any>cours d'eau
- <any>Geneva
- <any>canton
But for option 2, European and non european language (Chinese, Japanese, Korean) could be mixed in the indexed but search results could be inconsistent because of the analyser which have to be different. Storing field with a tag for lang will cause troubles on BooleanQuery creation Advanced Lucene functionnality could not be used (e.g. stop word list)
Using option3, the main language define in which index to store the record. Index content will be similar to option1 but specific Analyzer and language specific option could be set up. From a community perspective we should probably focus on a basic implementation (no advanced Lucene functionnality) of option 3 with a MultiSearcher support based on on index by language in order to improve support of multilingual catalogue. Option 3 will help implementation of « narrow your search » functionnality (stopword list, ...).
Allowing advanced functionnality is not used for now in GeoNetwork (Stop words, scoring over multiple language, multi analyzer support)..
Backwards Compatibility Issues
Risks
Participants
- List of participants and role (if necessary) in current GIP