Changes between Version 2 and Version 3 of MultilingualIndexMechanism
- Timestamp:
- 11/25/08 23:13:08 (16 years ago)
Legend:
- Unmodified
- Added
- Removed
- Modified
-
MultilingualIndexMechanism
v2 v3 9 9 10 10 == Overview == 11 Multilingual element indexing mechanism .11 Multilingual element indexing mechanism to allow index information with specific language parameters (stopwords list, analyzer). One index per language is created. 12 12 13 13 Metadata record have: 14 14 * one main language (for iso, fgdc, dc) 15 15 * some elements in multiple languages (only iso) 16 * some elements stored as ref using Xlink (which could be multilingual like keywords)17 16 18 17 In order to improve search result, a filter that replaces accented characters in the ISO Latin 1 character set (ISO-8859-1) by their unaccented equivalent. The case will not be altered (see org.apache.lucene.analysis.ISOLatin1AccentFilter). … … 36 35 37 36 == Motivations == 38 37 Index multilingual catalogue and multilingual metadata with Lucene and improve index mechanism and search results. 39 38 40 39 == Proposal == … … 43 42 * 2.one index for all languages, with an extra language field so searches can be constrained to a particular language 44 43 * 3.separate indices for each language 44 45 Option 3 is choosen for the following: 46 * allow specific analyzer to be used for each language (stopword list, tokenizer, analyzer) 47 * allow to have catalogue content based on European and non european language 48 49 On index for each language is created in WEB-INF/lucene/nonspatial_{iso3 language code}. 50 51 When indexing a document : 52 * define which is the main language (ie. for iso defined in gmd:language) 53 * apply index-fields to extract non language specific information (eg. date, codelist) and information in default language. 54 * apply language-index-fields to extract non language specific information (eg. date, codelist) and for each other language declared in gmd:locale, extract language specific information. 55 56 A same document could be in different index. In some situation, a search could return duplicates so a DuplicateFilter is added to the search. 57 58 In general, if user search in a user interface in english, then english index is boost, to return on top information from that index. 59 60 For french and german, FrenchAnalyzer and GermanAnalyzer are used to write information to the index. If not a StandardAnalizer is used. 61 62 63 === Known issue === 64 * Sort by title option will not work properly when a record is found in an index and its title is not in that index. 65 66 67 === Backwards Compatibility Issues === 68 69 == Risks == 70 71 == Participants == 72 * Jesse 73 * Francois 74 75 76 77 == More info == 45 78 46 79 Option 1 is the option used by GeoNetwork : … … 75 108 76 109 Using option3, the main language define in which index to store the record. Index content will be similar to option1 but specific Analyzer and language specific option could be set up. 77 From a community perspective we should probably focus on a basic implementation (no advanced Lucene functionnality)of option 3 with a MultiSearcher support based on on index by language in order to improve support of multilingual catalogue. Option 3 will help implementation of « narrow your search » functionnality (stopword list, ...).110 From a community perspective we should probably focus on an implementation of option 3 with a MultiSearcher support based on on index by language in order to improve support of multilingual catalogue. Option 3 will help implementation of « narrow your search » functionnality (stopword list, ...). 78 111 79 112 Allowing advanced functionnality is not used for now in GeoNetwork (Stop words, scoring over multiple language, multi analyzer support).. 80 113 81 === Backwards Compatibility Issues ===82 83 == Risks ==84 85 == Participants ==86 * List of participants and role (if necessary) in current GIP87