XPATH
author: Heikki Doeleman
This page describes the current ideas about implementing support for XPATH in the ebRIM project.
Introduction
XPATH is a language to precisely select a set of nodes in an XML document. It is also sometimes used to address certain parts of Java object graphs. One of the requirements for the ebRIM project is to support XPATH queries in OGC Filters.
Issue: XPATH on Lucene
It would be straightforward to evaluate XPATH against complete ebRIM documents (domain object graphs). However, we do not have all ebRIM data in memory, therefore we cannot simply evaluate XPATH queries against complete ebRIM documents. This suggests that the best option is to evaluate XPATH queries against the Lucene index. How should we go about doing that ?
A search on Google does not return too many insightful results. One implementation of XPATH against Lucene can be found in the CMS Alfresco, albeit with severe limitations: "does not currently support the attribute axis or predictes".
A hint as to their implementation is the description of Alfresco's Lucene index. It appears they use a dedicated field in the index to operate on with XPATH :
# PATH * An XPATH expression used to select nodes * This should only be access via a phrase query (ie in "") as it requires special tokenisation
I'm currently looking into their sources to learn more about their implementation, specifically org.alfresco.util.SearchLanguageConversion.
UPDATE : the class mentioned above does not (no longer?) exist. I've looked at Alfresco's alfresco-community-sdk-2.9.0B source, as they indicate that their version 3.0 is broken. In this source package there is a class org.alfresco.repo.search.impl.lucene.LuceneXPathHandler. However, even from just looking at the code I can see an inevitable NullPointerException, so I don't think this works all that well either. And of course the sources are very poorly documented inline. Hmmmph...
An exchange on the Solr mailing list point in the same direction (i.e. storing XPATH info in the Lucene index).
Then there's Nux, which also seems to support XPATH on Lucene: "Arbitrary Lucene fulltext queries can be run from Java or from XQuery/XPath/XSLT via a simple extension function." I have no clue yet how they do it, but I'm taking a peek at their source very soon.
Storing XPATH information in the index vs. mapping XPATH queries directly to structured Lucene queries
Erik and me have started a highly polemical debate about which is a better approach, storing XPATH info in Lucene or mapping XPATH queries to more structured Lucene queries.
Stored XPATH approach * use an extra field in the Lucene index to store XPATH information * the information stored at index time could be a !LocationPath to the object being indexed, which could be matched at search time to XPATH queries * Erik thinks this approach would 'pollute' our domain driven development model, as there is no intrinsic justification for this extra index field in our domain. * Heikki thinks that that doesn't matter, because the structure of a Lucene index is inherently not domain-driven (for example there are fields that are repeated TOKENIZED and NON-TOKENIZED to allow for both full text search and ordering of results).
Structured Lucene query approach * use Lucene queries that refer to properties of the queried object using the dot notation * example : !ExtrinsicObject.classificationList.Classification.id=someClassificationId * this would obviate the need for an extra field in the index * it is not known whether such Lucene queries actually work, in our index structure. Jose ? * it might incur severe performance penalties: consider the XPATH //*[@classificationId='someClassificationId'. Seems we'd need to look at all index entries to see what matches ?
We should decide on this matter very soon.
Other considerations
The above design decision is about how to relate an XPATH LocationPath to a valid result set from the Lucene index. XPATH is a much larger language and supports many constructs that combine LocationPaths (such as disjuncts), supports a set of built-in functions (for example arithmetic, like round()), a set of operators (for example aritmetic, like mod), and a set of axes which refer to other parts of the document in relation to the current node (such as following-sibling).
I have already created code for an XPATH interpreter that recursively processes an XPATH expression to the desired level, so it seems it won't be too difficult to support these 'advanced' features. The interpreter uses the JXPath library. An example is:
input XPATH : //rim:ExtrinsicObject[@objectType='urn:x-ogc:specification:csw-ebrim-cim:ObjectType:MetadataInformation']/rim:Slot[@name='http://purl.org/dc/elements/1.1/language' and @ slotType='urn:ogc:def:dataType:RFC-4646:Language'] output interpreter : 'locationpath': //rim:ExtrinsicObject[@objectType = 'urn:x-ogc:specification:csw-ebrim-cim:ObjectType:MetadataInformation']/rim:Slot[@name = 'http://purl.org/dc/elements/1.1/language' and @slotType = 'urn:ogc:def:dataType:RFC-4646:Language'] step: xpath axis: descendant-or-self step: rim:ExtrinsicObject[@objectType = 'urn:x-ogc:specification:csw-ebrim-cim:ObjectType:MetadataInformation'] xpath axis: child operation 'compare': @objectType = 'urn:x-ogc:specification:csw-ebrim-cim:ObjectType:MetadataInformation' symbol: = 'locationpath': @objectType step: @objectType xpath axis: attribute 'constant': 'urn:x-ogc:specification:csw-ebrim-cim:ObjectType:MetadataInformation' step: rim:Slot[@name = 'http://purl.org/dc/elements/1.1/language' and @slotType = 'urn:ogc:def:dataType:RFC-4646:Language'] xpath axis: child operation 'and': @name = 'http://purl.org/dc/elements/1.1/language' and @slotType = 'urn:ogc:def:dataType:RFC-4646:Language'