wiki:MultilingualIndexMechanism

Version 1 (modified by Fxp, 16 years ago) ( diff )

--

Proposal number : Proposal title

Date 2008/08/13
Contact(s) fxprunayre
Last edited Timestamp
Status draft
Assigned to release to be determined
Resources Resource available

Overview

Multilingual element indexing mechanism.

Metadata record have:

  • one main language (for iso, fgdc, dc)
  • some elements in multiple languages (only iso)
  • some elements stored as ref using Xlink (which could be multilingual like keywords)

In order to improve search result, a filter that replaces accented characters in the ISO Latin 1 character set (ISO-8859-1) by their unaccented equivalent. The case will not be altered (see org.apache.lucene.analysis.ISOLatin1AccentFilter).

Proposal Type

  • Type: Index
  • App: GeoNetwork
  • Module: Lucene Index

Voting History

  • Vote proposed by X on Y, result was +/-n (m non-voting members).

Motivations

Proposal

Index structure options:

  • 1.one index for all languages
  • 2.one index for all languages, with an extra language field so searches can be constrained to a particular language
  • 3.separate indices for each language

Option 1 is the option used by GeoNetwork :

  • <language>fra
  • <title>cours d'eau du canton de Genève
  • <keyword>cours d'eau
  • <any>fiume
  • <any>rivers
  • <any>cours d'eau
  • <any>Geneva
  • <any>canton

Requirements add an index field on the language.

Option 2 could allow Term search in one language: index fields structure for one document could be :

  • <language>fra
  • <title>cours d'eau du canton de Genève
  • ...
  • <title_eng>rivers in Geneva canton
  • <title_ita>fiume en Geneva canton
  • <keyword>cours d'eau
  • <keyword_ita>fiume
  • <any>fiume
  • <any>rivers
  • <any>cours d'eau
  • <any>Geneva
  • <any>canton

But for option 2, European and non european language (Chinese, Japanese, Korean) could be mixed in the indexed but search results could be inconsistent because of the analyser which have to be different. Storing field with a tag for lang will cause troubles on BooleanQuery creation Advanced Lucene functionnality could not be used (e.g. stop word list)

Using option3, the main language define in which index to store the record. Index content will be similar to option1 but specific Analyzer and language specific option could be set up. From a community perspective we should probably focus on a basic implementation (no advanced Lucene functionnality) of option 3 with a MultiSearcher support based on on index by language in order to improve support of multilingual catalogue. Option 3 will help implementation of « narrow your search » functionnality (stopword list, ...).

Allowing advanced functionnality is not used for now in GeoNetwork (Stop words, scoring over multiple language, multi analyzer support)..

Backwards Compatibility Issues

Risks

Participants

  • List of participants and role (if necessary) in current GIP
Note: See TracWiki for help on using the wiki.