Changes between Initial Version and Version 1 of MultilingualIndexMechanism


Ignore:
Timestamp:
Aug 13, 2008, 1:04:40 AM (16 years ago)
Author:
Fxp
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • MultilingualIndexMechanism

    v1 v1  
     1= Proposal number : Proposal title =
     2
     3|| '''Date''' || 2008/08/13 ||
     4|| '''Contact(s)''' || fxprunayre ||
     5|| '''Last edited''' || [[Timestamp]] ||
     6|| '''Status''' || draft ||
     7|| '''Assigned to release''' || to be determined ||
     8|| '''Resources''' || Resource available ||
     9
     10== Overview ==
     11Multilingual element indexing mechanism.
     12
     13Metadata record have:
     14 * one main language (for iso, fgdc, dc)
     15 * some elements in multiple languages (only iso)
     16 * some elements stored as ref using Xlink (which could be multilingual like keywords)
     17
     18In order to improve search result, a filter that replaces accented characters in the ISO Latin 1 character set (ISO-8859-1) by their unaccented equivalent. The case will not be altered (see org.apache.lucene.analysis.ISOLatin1AccentFilter).
     19
     20=== Proposal Type ===
     21 * '''Type''': Index
     22 * '''App''': !GeoNetwork
     23 * '''Module''': Lucene Index
     24
     25=== Links ===
     26 * '''Documents''':
     27  * http://www.mail-archive.com/java-user@lucene.apache.org/msg01295.html
     28  * http://www.mail-archive.com/java-user@lucene.apache.org/msg02736.html
     29 * '''Email discussions''':
     30 * '''Other wiki discussions''':
     31
     32=== Voting History ===
     33 * Vote proposed by X on Y, result was +/-n (m non-voting members).
     34
     35----
     36
     37== Motivations ==
     38
     39
     40== Proposal ==
     41Index structure options:
     42 * 1.one index for all languages
     43 * 2.one index for all languages, with an extra language field so searches can be constrained to a particular language
     44 * 3.separate indices for each language
     45
     46Option 1 is the option used by GeoNetwork :
     47 * <language>fra
     48 * <title>cours d'eau du canton de Genève
     49 * <keyword>cours d'eau
     50 * <any>fiume
     51 * <any>rivers
     52 * <any>cours d'eau
     53 * <any>Geneva
     54 * <any>canton
     55
     56Requirements add an index field on the language.
     57
     58Option 2 could allow Term search in one language:
     59index fields structure for one document could be :
     60 * <language>fra
     61 * <title>cours d'eau du canton de Genève
     62 * ...
     63 * <title_eng>rivers in Geneva canton
     64 * <title_ita>fiume en Geneva canton
     65 * <keyword>cours d'eau
     66 * <keyword_ita>fiume
     67 * <any>fiume
     68 * <any>rivers
     69 * <any>cours d'eau
     70 * <any>Geneva
     71 * <any>canton
     72But for option 2,
     73European and non european language (Chinese, Japanese, Korean) could be mixed in the indexed but search results could be inconsistent because of the analyser which have to be different. Storing field with a tag for lang will cause troubles on BooleanQuery creation
     74Advanced Lucene functionnality could not be used (e.g. stop word list)
     75
     76Using option3, the main language define in which index to store the record. Index content will be similar to option1 but specific Analyzer and language specific option could be set up.
     77From a community perspective we should probably focus on a basic implementation (no advanced Lucene functionnality) of option 3 with a MultiSearcher support based on on index by language in order to improve support of multilingual catalogue. Option 3 will help implementation of « narrow your search » functionnality (stopword list, ...).
     78
     79Allowing advanced functionnality is not used for now in GeoNetwork (Stop words, scoring over multiple language, multi analyzer support)..
     80
     81=== Backwards Compatibility Issues ===
     82
     83== Risks ==
     84
     85== Participants ==
     86 * List of participants and role (if necessary) in current GIP
     87