Version 18 (modified by 16 years ago) ( diff ) | ,
---|
Hibernate Search
author: Heikki Doeleman
This page describes GeoNetwork's usage of the Hibernate Search library.
Introduction
Hibernate Search is a library that combines the strengths of full text search using Lucene with Hibernate's O/R mapping capabilities. Queries in Hibernate Search are expressed as wrappers around Lucene queries. Hibernate Search seems to offer 2 principal advantages over directly using Lucene plus a database: (1) Lucene information (about the index, analyzers to be used, etc.) is expressed using annotations on the domain objects involved; and (2) synchronization (re-indexing) is automatically triggered when Hibernate makes a change to the database.
Directory
A Lucene index is represented by a Directory. We will use a file system directory provider to persistently store the Lucene index; and we will use an in-memory directory provider to use with unit test.
Analyzer
Lucene offers a lot of functions in order to search more precise. It can be done by defining tokenizers and filters. Those filters and tokenisers have been chosen who makes the most sense. Using all filters makes no sense because there are too much and would be bad for the performance. Every installation of the Ebrim application can be easily configured with its own set of tokenizers and filters.
These are the tokenizers and filters used:
StandardTokenizer : The StandardTokenizer should support most needs for English (and most European languages) texts. It splits words at punctuation characters and removing punctuation signs with a couple of exception rules
StandardFilter : The StandardFilter removes apostrophes and remove dots in acronyms.
LowerCaseFilter : The LowerCaseFilter changes all characters to lower case.
ISOLatin1AccentFilterFactory : Abstract over accented characters.
Implementation Analyzer
By default Hibernate Search is configured with the StandardFilter. One filter can be set via its configuration. Defining other has to be done via the Hibernate Search Annotations. An example of an annotation including StandardFilter, LowerCaseFilter and ISOLatin1AccentFilterFactory is this:
@AnalyzerDef(tokenizer = @TokenizerDef(factory = StandardTokenizerFactory.class), filters = {
@TokenFilterDef(factory = StandardFilter.class), @TokenFilterDef(factory = LowerCaseFilter.class), @TokenFilterDef(factory = ISOLatin1AccentFilterFactory.class)
})
It is needed to add an extra library to the application according this Maven definition:
<dependency>
<groupId>org.apache.solr</groupId> <artifactId>solr-lucene-analyzers</artifactId> <version>1.3.0</version>
</dependency>
For some reason Eclipse still keeps complaining about a org.apache.solr.analysis.TokenFilterFactory which should be added to the classpath. Running the artefact on the command line does not give any problem. For now this changes has not been committed. It has to be investigated whether there is really a problem or that this is an Eclipse bug (most likely).
Indexing
It seems straightforward to use an asynchronous thread in Hibernate Search to do the indexing. Will we use that approach ?
We should use transparent indexing except in the case of application start-up, where we must define some strategy.. the current way in GeoNetwork is to check if a Lucene index is present and if not, build it.
Lucene performance
Lucene is known to be superfast. Even so we can and should do a performance analysis, using Search Quality Benchmarking.