MapGuide RFC 123 - Feature Join optimization shortcut using FDO Join APIs
This page contains a change request (RFC) for the MapGuide Open Source project. More MapGuide RFCs can be found on the RFCs page.
|RFC Template Version||(1.0)|
|Submission Date||24 October 2011|
|Last Modified||8 November 2011|
|Assigned PSC guide(s)||(when determined)|
|Voting History||(vote date)|
This RFC proposes to take advantage of recently introduced FDO join APIs to provide an optimization path for Feature Joins in MapGuide under certain Feature Source configurations.
Feature Joins has been a traditionally underperforming and buggy aspect of MapGuide Open Source / MGE / AIMS. A look at the number of tickets in trac related to Feature Joins can attest to this fact.
Performance workarounds exist for SQL-based feature sources involving datastore-level joins encapulsated as a view (and possibly some metadata hacks in the datastore to be recognised as a feature class). This approach, while addressing the performance problem, presents its own set of issues, namely constraints imposed by the data store as a result of using a view.
As of FDO 3.6, new APIs were introduced allowing support for native joins at the data store level, which is much more efficent and better performing than the costly in-memory joins performed by the GWS Query Engine component. With some minor modifications in the feature query logic, MapGuide can take advantage of these new FDO Join APIs if the extended feature class being queried satisfies some conditions (outlined in the Proposed Solution).
By using FDO join APIs, we can tackle the performance and bugginess of Feature Joins simultaneously because both aspects are now delegated to the underlying FDO data store.
It is hoped with the implementation of this RFC that we end up with a more positive user story with regards to Feature Joins.
The solution consists of 4 parts:
- Testing for the FDO join optimization path when performing a SelectFeatures() call against an extended feature class
- Setting up the FDO join query.
- Handling the iteration logic for the results of a FDO join query.
- Handling aggregates
Each part is explained in detail below.
Testing for the optimization
When performing a feature query against an extended feature class that consists of joins, use the FDO join APIs if the extended feature class satisfies the following criteria:
- The Feature Source this extended feature class belongs to supports joins (ie. The SupportJoins capability returns true)
- The Feature Source this extended feature class belongs to joins with another feature class from the same feature source (this is because FDO Joins work within the context of the same connection). The feature class being joined on cannot itself be an extended feature class.
- The query options cannot contain computed expressions based on secondary class properties.
- The extended feature class only contains one join. Supporting chained or multiple joins is beyond the scope of this RFC.
- If this extended feature class is being queried with a filter, the filter does not contain any references to properties on the feature class to be joined on. This is because there is no way I am aware of in the FDO Join API to apply filters to the non-primary sides of the join.
- The type of join being performed is supported by the underlying FDO provider.
- All the above conditions
- The extended class is being queried with only one aggregate function: SpatialExtents(). See the section Aggregates for an explanation of why we're only supporting this one function.
If any of the above criteria is not met, the extended feature class in question is considered not to meet the FDO join requirements and is delegated off to the GWS Query Engine as it currently is.
As of writing this RFC, the following FDO providers support datastore-level joins:
- SQLite (3.7/Trunk)
- SQL Server Spatial (3.7/Trunk)
Feature Sources using the above providers that contain extended feature classes configured in the above fashion will stand to take advantage of this optimization path. Any other FDO providers that do implement the required FDO Join APIs in the future, feature sources using these providers will be automatically eligible for this optimization path if they meet the same criteria outlined above.
Setting up the FDO join query
We use the regular FdoISelect interface to perform the FDO join query. The extended feature class already uses an optional prefix on the secondary class as a means of disambiguating identically named properties on both sides of the join. We will use this same prefix to alias the secondary property names to prevent ambiguous names during query preparation.
Properties from the primary class will be specified as the following FDO computed property:
primary.[PropertyName] AS [PropertyName]
primary will be specified as the alias for the FdoISelect::SetAlias() method
Properties from the secondary class will be specified as the following FDO computed property:
secondary.[PropertyName] AS [Prefix][PropertyName]
secondary will be used as the alias for the join criteria that is added to the FdoISelect's join criteria collection.
Because the Extended Class Definition in a feature source does not explicity specify the list of properties from the secondary class to include, we include all properties from any secondary class that we are joining on by default.
Through this setup, the returned feature reader will present the same property list as a reader returned by the GWS Query Engine.
The FDO join feature reader
We will include a custom server-only implementation of FdoIFeatureReader (MgFdoForcedOneToOneFeatureReader) which wraps an existing FdoIFeatureReader to incorporate iteration logic to handle forcing 1:1 cardinality on the iterated results.
This reader behaves like a normal feature reader. All FdoIFeatureReader calls pass through to the underlying reader, except for ReadNext()
To force 1:1 cardinality, the ReadNext() method employs the following logic:
- Read the current identity property values. Hash these values into a string and check if this hashed string exists in an internal std::set.
- If this value exists, keep reading until we either get a hashed string that does not yet exist in the internal set, or until we reached the end of the feature reader, in which case we return false
- Store this hashed string into the internal set for future comparisons.
Depending on whether 1:1 cardinality is forced, the final reader that is returned to the web tier will either be:
MgServerFeatureReader (containing...) FdoIFeatureReader (from FdoISelect::Execute())
Or if 1:1 is forced
MgServerFeatureReader (containing...) MgFdoForcedOneToOneFeatureReader (containing...) FdoIFeatureReader (from FdoISelect::Execute())
Performance-wise, support for FDO joins has already been implemented in mg-desktop and the benefits there are clear. The performance times for a 17000 feature by 16000 feature join (of this dataset) are outlined below.
|Left Outer Join||479.4s||4.2s|
|Inner Join (1:1)||595.2s||4.4s|
|Left Outer Join (1:1)||320.9s||4.6s|
As can be seen from this table, FDO joins performed on average, 100 times faster than the traditional feature join counterpart! The SQLite provider implements the FDO join APIs. The SDF provider does not and is delegated to the GwsQueryEngine.
In the current implementation for the MapGuide Server, the SQLite numbers are even faster (in this particular case, up to a second faster) due to connection pooling and other caching mechanisms that are absent from the mg-desktop implementation.
If a SelectAggregates operation is invoked on an extended feature class, the same FDO join optimization condition checks are done against the extended feature class and its feature source and the same extended select command is executed.
The reason for supporting aggregates is to ensure that we can use SpatialExtents() on extended feature classes. This is what the Feature Source Previews uses to obtain a usable MBR for previewing. Thus aggregate function support in this RFC is only limited to supporting SpatialExtents() and nothing more.
Another special case with aggregate functions is the need to fetch distinct values. The UNIQUE() aggregate function is used by Maestro/AIStudio to fetch distinct values to generate themes. As performance with this function is generally respectable in its current form, if this function is detected in the aggregate query options, and the extended class being queried is capable of FDO joins, we will take the regular query path instead of the FDO join one. This is mainly to avoid the currently un-handled case of computed expressions on prefixed secondary properties in the FDO join path.
NOTE: It was originally planned to use FdoISelectAggregates with the new join criteria APIs (so the final query can be executed natively at the data store level), but it has come to our attention that support for SpatialExtents() in a select aggregates command is a case that is not yet covered by the providers that currently support these APIs. Thus the Expression Engine serves as a fallback approach until such cases are supported commonplace.
No public APIs are affected. This is a server-side modification to take advantage of new APIs introduced by an external component (FDO). At the WebTier level, processing FDO join results is still done through the existing MgProxyFeatureReader. No changes are required on the Web Tier.
No schema modifications are required for the Feature Source Schema. The extended class definition already provides enough information to construct an equivalent FDO join query.
Though not necessary, it would be nice for the join editors in Maestro and Infrastructure Studio to notify the user of such optimization availability if the edited feature source in question is configured correctly.
Add unit tests against some sample SQLite feature sources configured to take advantage of these optimizations. Verify that the FDO join optimization path is taken when selecting from these feature sources. Also exercise the SpatialExtents() aggregate function against SQLite extended feature classes which can take advantage of this optimization.
Funding / Resources