Lucene-Only search mode
Date | 2011-10-31 |
Contact(s) | François Prunayre |
Last edited | 2011-10-31T18:25:00 |
Status | Being discussed, in progress, initial implementation in progress |
Assigned to release | 2.7.x |
Resources | Available (funded by BRGM) |
Ticket # | #652 |
Overview
Current search rely on the following steps:
- Lucene Search
- Retrieve Metadata from Database (according to search criteria and paging)
- XSL presentation.
The aim of this proposal is to add a new search mode relying only on Lucene. This mode requires to structure and/or add all information required for results presentation in the index.
The Lucene only search is available through the "q" service which return XML response. The widgets interface could be configured to use this service instead of xml.search (ie. classic user interface will not change).
Proposal Type
- Type: Lucene, Core Change
- App: GeoNetwork
- Module: LuceneSearcher, Widgets
Links
- Documents:
Voting History
- Vote proposed by François 25/11/2011.
- +1 for Simon, Emanuele, Jeroen, Francois
- Comment in favor from Douglas Nebert, Andrew Walsh, Jesse Eichar,
Motivations
- Performance improvements: search could be 10 to 20 time faster and better support concurrent users.
The following charts compare 3 services:
- xml.search in default mode (ie with DB, with XSL)
- xml.search in fast mode (ie. with DB, no XSL)
- q service (#485) which dump all index fields (ie. no DB, no XSL)
Testing made:
- with geocat.ch content (ie. 4000 records)
- 10 iterations
Number of concurrent users increasing
number of records returned per page
Proposal
The main requirement is to store all information available in the "brief" format in the index in order to retrieve those information at search time. Brief format is the pivot format used by GeoNetwork to display search results via xml.search or main.search.embeded services. Brief format fields are the following:
- id
- uuid
- title
- abstract
- keyword
- parentId
- datasetcreationdate
- geoBox
- westBL
- eastBL
- southBL
- northBL
- Constraints (not supported/used in widget GUI, complex XML or as CData ?)
- SecurityConstraints (not supported/used in widget GUI, complex XML or as CData ?)
- LegalConstraints (not supported/used in widget GUI, complex XML or as CData ?)
- temporalExtent (not supported/used in widget GUI)
- begin
- end
- image type="unknown|thumbnail|overview"
- responsibleParty role="{$role}" appliesTo="resource" logo=""
- link title="" href="" name="" protocol="" type=""
- category
- + geonet:info/*
The q service (#485) requires to retrieve those information from the index directly instead of dumping all index fields. The configuration is stored in config-lucene.xsl. All field to dump in the response MUST be stored. The others could not be stored (Note: total index size should be similar after this proposal changes because many useless fields are currently stored).
<!-- List of fields to dump when using q service. Fields must be stored in the index. --> <dumpFields> <field name="_isTemplate" tagName="isTemplate"/> <field name="_isHarvested" tagName="isHarvested"/> <field name="_popularity" tagName="popularity"/> <field name="_rating" tagName="rating"/> <field name="_displayOrder" tagName="displayOrder"/> <field name="_view" tagName="view"/> <field name="_notify" tagName="notify"/> <field name="_download" tagName="download"/> <field name="_dynamic" tagName="dynamic"/> <field name="_featured" tagName="featured"/> <field name="_owner" tagName="owner"/> <field name="_isPublishedToAll" tagName="isPublishedToAll"/> <field name="_ownername" tagName="ownername"/> <field name="_cat" tagName="category"/> <field name="_valid" tagName="valid"/> <field name="_valid_schematron-rules-geonetwork" tagName="valid_schematron-rules-geonetwork"/> <field name="_valid_schematron-rules-iso" tagName="valid_schematron-rules-iso"/> <field name="_valid_schematron-rules-inspire" tagName="valid_schematron-rules-inspire"/> <field name="_valid_xsd" tagName="valid_xsd"/> <field name="_selected" tagName="selected"/> <field name="_source" tagName="source"/> <field name="_edit" tagName="edit"/> <field name="title" tagName="title"/> <field name="abstract" tagName="abstract"/> <field name="keyword" tagName="keyword"/> <field name="parentUuid" tagName="parentId"/> <field name="image" tagName="image"/> <field name="link" tagName="link"/> <field name="responsibleParty" tagName="responsibleParty"/> <field name="accessConstr" tagName="Constraints"/> <field name="otherConstr" tagName="Constraints"/> <field name="classif" tagName="SecurityConstraints"/> <field name="conditionApplyingToAccessAndUse" tagName="Constraints"/> <field name="datasetLang" tagName="datasetLang"/> <field name="language" tagName="language"/> <field name="spatialRepresentationType" tagName="spatialRepresentationType"/> <field name="serviceType" tagName="serviceType"/> <field name="geoBox" tagName="geoBox"/> </dumpFields>
Complex fields like image, link, geoBox which are composed of child element are stored as "|" separated value. Those fields are structured in index-fields.xsl. Example:
<Field name="geoBox" string="{concat(gmd:westBoundLongitude/gco:Decimal, '|', gmd:southBoundLatitude/gco:Decimal, '|', gmd:eastBoundLongitude/gco:Decimal, '|', gmd:northBoundLatitude/gco:Decimal )}" store="true" index="false"/>
The client DataStore takes care of splitting the value to extract the information.
<?xml version="1.0" encoding="UTF-8"?> <response from="1" to="1" selected="0"> <summary count="1" type="local" hitsusedforsummary="1"> ... </summary> <metadata> <popularity>3</popularity> <source>2f788e36-ca8e-4eeb-adc6-4d0c7da6eaf1</source> <owner>1</owner> <link>|Online link to the 'Water Resources and Irrigation in Africa'- website|http://www.fao.org/ag/AGL/aglw/aquastat/watresafrica/index.stm|WWW:LINK-1.0-http--link|text/html</link> <link>basins.zip|Hydrological basins in Africa (Shapefile Format)|http://localhost:8080/geonetwork/srv/en/resources.get?id=10&fname=basins.zip&access=private|WWW:DOWNLOAD-1.0-http--download|application/zip</link> <link>hydrological_basins|Hydrological basins in Africa|http://geonetwork3.fao.org/ows/296|OGC:WMS-1.1.1-http-get-map|application/vnd.ogc.wms_xml</link> <responsibleParty>pointOfContact|metadata|FAO - NRCW|</responsibleParty> <title>Hydrological Basins in Africa (Sample record, please remove!)</title> <isTemplate>n</isTemplate> <valid>-1</valid> <rating>0</rating> <category>maps</category> <category>datasets</category> <category>interactiveResources</category> <abstract>Major hydrological basins and their sub-basins. This dataset ... assigned respectively to internal sub-basins and to sub-basins draining into the sea)</abstract> <keyword>watersheds</keyword> <keyword>river basins</keyword> <keyword>water resources</keyword> <keyword>hydrology</keyword> <keyword>AQUASTAT</keyword> <keyword>AWRD</keyword> <keyword>Africa</keyword> <keyword>inlandWaters</keyword> <image>thumbnail|../../srv/en/resources.get?uuid=da165110-88fd-11da-a88f-000d939bc5d8&fname=thumbnail_s.gif&access=public</image> <datasetLang>eng</datasetLang> <geoBox>-17.3|-34.6|51.1|38.2</geoBox> <isHarvested>n</isHarvested> <spatialRepresentationType>vector</spatialRepresentationType> <language>eng</language> <geonet:info xmlns:geonet="http://www.fao.org/geonetwork"> <id>1625</id> <uuid>da165110-88fd-11da-a88f-000d939bc5d8</uuid> <schema>iso19139</schema> <createDate>2007-07-19T14:45:07</createDate> <changeDate>2007-11-06T12:13:00</changeDate> <source>2f788e36-ca8e-4eeb-adc6-4d0c7da6eaf1</source> <edit>true</edit> <owner>true</owner> <selected>false</selected> </geonet:info> </metadata>
Backwards Compatibility Issues
None.
New libraries added
None.
Further improvements
- Use similar mechanism for
- xml.relation service
- OpenSearch
- CSW search when complete ISO record is not needed (eg. dublin-core output)
- Z39.50 ?
- Support multilingual metadata: classic mechanism to use GUI language and fallback to main language is not implemented for this service. It should be part of the multilingual metadata indexing proposal (http://trac.osgeo.org/geonetwork/wiki/MultilingualIndexMechanism) and require more work.
- Create a Jeeves service with JSON output instead of XML+XSL to speed up JS client side processing
- Keep the Searcher in session to not re-run search on all requests (like main.search do with search and present steps)
(Funding required)
Risks
Participants
- François Prunayre
- Others?
Attachments (3)
- geonetwork-pure-lucene-search.png (151.3 KB ) - added by 13 years ago.
- geonetwork-pure-lucene-search-perf.png (96.0 KB ) - added by 13 years ago.
- geonetwork-pure-lucene-search-perf-with-larger-number-of-results.png (86.1 KB ) - added by 13 years ago.
Download all attachments as: .zip