Context Navigation

Changes between Version 2 and Version 3 of MultilingualIndexMechanism

Timestamp:: 11/25/08 23:13:08 (16 years ago)
Author:: Fxp
Comment:: --

Legend:

: Unmodified
: Added
: Removed
: Modified

MultilingualIndexMechanism

-              v2
+              v3
 == Overview ==
 Multilingual element indexing mechanism.
+Multilingual element indexing mechanism to allow index information with specific language parameters (stopwords list, analyzer). One index per language is created.
 Metadata record have:
  * one main language (for iso, fgdc, dc)
  * some elements in multiple languages (only iso)
- * some elements stored as ref using Xlink (which could be multilingual like keywords)
 In order to improve search result, a filter that replaces accented characters in the ISO Latin 1 character set (ISO-8859-1) by their unaccented equivalent. The case will not be altered (see org.apache.lucene.analysis.ISOLatin1AccentFilter).
 …
 == Motivations ==
+Index multilingual catalogue and multilingual metadata with Lucene and improve index mechanism and search results.
 == Proposal ==
 …
  * 2.one index for all languages, with an extra language field so searches can be constrained to a particular language
  * 3.separate indices for each language
+Option 3 is choosen for the following:
+ * allow specific analyzer to be used for each language (stopword list, tokenizer, analyzer)
+ * allow to have catalogue content based on European and non european language
+On index for each language is created in WEB-INF/lucene/nonspatial_{iso3 language code}.
+When indexing a document :
+ * define which is the main language (ie. for iso defined in gmd:language)
+ * apply index-fields to extract non language specific information (eg. date, codelist) and information in default language.
+ * apply language-index-fields to extract non language specific information (eg. date, codelist) and for each other language declared in gmd:locale, extract language specific information.
+A same document could be in different index. In some situation, a search could return duplicates so a DuplicateFilter is added to the search.
+In general, if user search in a user interface in english, then english index is boost, to return on top information from that index.
+For french and german, FrenchAnalyzer and GermanAnalyzer are used to write information to the index. If not a StandardAnalizer is used.
+=== Known issue ===
+ * Sort by title option will not work properly when a record is found in an index and its title is not in that index.
+=== Backwards Compatibility Issues ===
+== Risks ==
+== Participants ==
+ * Jesse
+ * Francois
+== More info ==
 Option 1 is the option used by GeoNetwork :
 …
 Using option3, the main language define in which index to store the record. Index content will be similar to option1 but specific Analyzer and language specific option could be set up.
 From a community perspective we should probably focus on a basic implementation (no advanced Lucene functionnality) of option 3 with a MultiSearcher support based on on index by language in order to improve support of multilingual catalogue. Option 3 will help implementation of « narrow your search » functionnality (stopword list, ...).
+From a community perspective we should probably focus on an implementation of option 3 with a MultiSearcher support based on on index by language in order to improve support of multilingual catalogue. Option 3 will help implementation of « narrow your search » functionnality (stopword list, ...).
 Allowing advanced functionnality is not used for now in GeoNetwork (Stop words, scoring over multiple language, multi analyzer support)..
-=== Backwards Compatibility Issues ===
-== Risks ==
-== Participants ==
- * List of participants and role (if necessary) in current GIP