= Proposal number : Proposal title = || '''Date''' || 2008/08/13 || || '''Contact(s)''' || fxprunayre || || '''Last edited''' || [[Timestamp]] || || '''Status''' || draft || || '''Assigned to release''' || to be determined || || '''Resources''' || Resource available || == Overview == Multilingual element indexing mechanism. Metadata record have: * one main language (for iso, fgdc, dc) * some elements in multiple languages (only iso) * some elements stored as ref using Xlink (which could be multilingual like keywords) In order to improve search result, a filter that replaces accented characters in the ISO Latin 1 character set (ISO-8859-1) by their unaccented equivalent. The case will not be altered (see org.apache.lucene.analysis.ISOLatin1AccentFilter). === Proposal Type === * '''Type''': Index * '''App''': !GeoNetwork * '''Module''': Lucene Index === Links === * '''Documents''': * http://www.mail-archive.com/java-user@lucene.apache.org/msg01295.html * http://www.mail-archive.com/java-user@lucene.apache.org/msg02736.html * '''Email discussions''': * '''Other wiki discussions''': === Voting History === * Vote proposed by X on Y, result was +/-n (m non-voting members). ---- == Motivations == == Proposal == Index structure options: * 1.one index for all languages * 2.one index for all languages, with an extra language field so searches can be constrained to a particular language * 3.separate indices for each language Option 1 is the option used by GeoNetwork : * fra * cours d'eau du canton de Genève * <keyword>cours d'eau * <any>fiume * <any>rivers * <any>cours d'eau * <any>Geneva * <any>canton Requirements add an index field on the language. Option 2 could allow Term search in one language: index fields structure for one document could be : * <language>fra * <title>cours d'eau du canton de Genève * ... * <title_eng>rivers in Geneva canton * <title_ita>fiume en Geneva canton * <keyword>cours d'eau * <keyword_ita>fiume * <any>fiume * <any>rivers * <any>cours d'eau * <any>Geneva * <any>canton But for option 2, European and non european language (Chinese, Japanese, Korean) could be mixed in the indexed but search results could be inconsistent because of the analyser which have to be different. Storing field with a tag for lang will cause troubles on BooleanQuery creation Advanced Lucene functionnality could not be used (e.g. stop word list) Using option3, the main language define in which index to store the record. Index content will be similar to option1 but specific Analyzer and language specific option could be set up. From a community perspective we should probably focus on a basic implementation (no advanced Lucene functionnality) of option 3 with a MultiSearcher support based on on index by language in order to improve support of multilingual catalogue. Option 3 will help implementation of « narrow your search » functionnality (stopword list, ...). Allowing advanced functionnality is not used for now in GeoNetwork (Stop words, scoring over multiple language, multi analyzer support).. === Backwards Compatibility Issues === == Risks == == Participants == * List of participants and role (if necessary) in current GIP