Changes between Version 2 and Version 3 of MultilingualIndexMechanism


Ignore:
Timestamp:
11/25/08 23:13:08 (16 years ago)
Author:
Fxp
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • MultilingualIndexMechanism

    v2 v3  
    99
    1010== Overview ==
    11 Multilingual element indexing mechanism.
     11Multilingual element indexing mechanism to allow index information with specific language parameters (stopwords list, analyzer). One index per language is created.
    1212
    1313Metadata record have:
    1414 * one main language (for iso, fgdc, dc)
    1515 * some elements in multiple languages (only iso)
    16  * some elements stored as ref using Xlink (which could be multilingual like keywords)
    1716
    1817In order to improve search result, a filter that replaces accented characters in the ISO Latin 1 character set (ISO-8859-1) by their unaccented equivalent. The case will not be altered (see org.apache.lucene.analysis.ISOLatin1AccentFilter).
     
    3635
    3736== Motivations ==
    38 
     37Index multilingual catalogue and multilingual metadata with Lucene and improve index mechanism and search results.
    3938
    4039== Proposal ==
     
    4342 * 2.one index for all languages, with an extra language field so searches can be constrained to a particular language
    4443 * 3.separate indices for each language
     44
     45Option 3 is choosen for the following:
     46 * allow specific analyzer to be used for each language (stopword list, tokenizer, analyzer)
     47 * allow to have catalogue content based on European and non european language
     48
     49On index for each language is created in WEB-INF/lucene/nonspatial_{iso3 language code}.
     50
     51When indexing a document :
     52 * define which is the main language (ie. for iso defined in gmd:language)
     53 * apply index-fields to extract non language specific information (eg. date, codelist) and information in default language.
     54 * apply language-index-fields to extract non language specific information (eg. date, codelist) and for each other language declared in gmd:locale, extract language specific information.
     55
     56A same document could be in different index. In some situation, a search could return duplicates so a DuplicateFilter is added to the search.
     57
     58In general, if user search in a user interface in english, then english index is boost, to return on top information from that index.
     59
     60For french and german, FrenchAnalyzer and GermanAnalyzer are used to write information to the index. If not a StandardAnalizer is used.
     61
     62
     63=== Known issue ===
     64 * Sort by title option will not work properly when a record is found in an index and its title is not in that index.
     65
     66
     67=== Backwards Compatibility Issues ===
     68
     69== Risks ==
     70
     71== Participants ==
     72 * Jesse
     73 * Francois
     74
     75
     76
     77== More info ==
    4578
    4679Option 1 is the option used by GeoNetwork :
     
    75108
    76109Using option3, the main language define in which index to store the record. Index content will be similar to option1 but specific Analyzer and language specific option could be set up.
    77 From a community perspective we should probably focus on a basic implementation (no advanced Lucene functionnality) of option 3 with a MultiSearcher support based on on index by language in order to improve support of multilingual catalogue. Option 3 will help implementation of « narrow your search » functionnality (stopword list, ...).
     110From a community perspective we should probably focus on an implementation of option 3 with a MultiSearcher support based on on index by language in order to improve support of multilingual catalogue. Option 3 will help implementation of « narrow your search » functionnality (stopword list, ...).
    78111
    79112Allowing advanced functionnality is not used for now in GeoNetwork (Stop words, scoring over multiple language, multi analyzer support)..
    80113
    81 === Backwards Compatibility Issues ===
    82 
    83 == Risks ==
    84 
    85 == Participants ==
    86  * List of participants and role (if necessary) in current GIP
    87