Changes between Initial Version and Version 1 of RFC64-Draft


Ignore:
Timestamp:
Oct 31, 2010, 9:39:52 AM (14 years ago)
Author:
sdlime
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • RFC64-Draft

    v1 v1  
     1== RFC 64 - MapServer Expression Parser Overhaul ==
     2
     3
     4== Overview ==
     5
     6This is a draft RFC addressing 1) how the Bison/Yacc parser for logical expressions is implemented and 2) where in the MapServer code the parser can be used. This RFC could have broader impacts on query processing depending on additional changes at the driver level, specifically the RDBMS ones. Those changes don't have to occur for this to be a useful addition.
     7
     8A principle motivation for this work is to support OGC filter expressions in a single pass in a driver-independent manner.
     9
     10All of the work detailed here is being prototyped in a sandbox, visit:
     11
     12  [http://svn.osgeo.org/mapserver/sandbox/sdlime/common-expressions/mapserver]
     13
     14== Existing Expression Parsing ==
     15
     16The existing logical expression handling in MapServer works like so:
     17
     18  1) duplicate expression string[[BR]]
     19  2) substitute shape attributes into string (e.g. '[name]' => 'Anoka')[[BR]]
     20  3) parse with yyparse()
     21
     22The parser internally calls yylex() for its tokens. Tokens are the smallest pieces of an expression.
     23
     24'''Advantages'''
     25
     26  - it's simple and it works
     27
     28'''Disadvantages'''
     29
     30  - limited by substitution to strings, no complex types can be handled
     31  - have to perform the substitution and tokenize the resulting string for every feature
     32
     33== Proposed Technical Changes ==
     34
     35This RFC proposes a number of technical changes. The core change, however, involved updating the way logical expressions work. Additional features capitalize on this core change to bring additional capabilities to MapServer.
     36
     37'''Core Parser Update'''
     38
     39I propose moving to a setup where a logical expression is tokenized once (via our Flex-generated lexer) and then Bison/Yacc parser works through tokens (via a custom version of yylex() defined in mapparser.y) as necessary for each feature. This eliminates the substitution and tokenize steps currently necessary and opens up possibilities for supporting more complex objects in expressions. Basically we'd hang a list/array of tokens off an expressionObj, populate it in msLayerWhichItems() and leverage the tokens as needed in the parser. The following new structs and enums are added to mapserver.h:
     40
     41{{{
     42enum MS_TOKEN_LOGICAL_ENUM { MS_TOKEN_LOGICAL_AND=100, MS_TOKEN_LOGICAL_OR, MS_TOKEN_LOGICAL_NOT };
     43enum MS_TOKEN_LITERAL_ENUM { MS_TOKEN_LITERAL_NUMBER=110, MS_TOKEN_LITERAL_STRING, MS_TOKEN_LITERAL_TIME, MS_TOKEN_LITERAL_SHAPE };
     44enum MS_TOKEN_COMPARISON_ENUM {
     45  MS_TOKEN_COMPARISON_EQ=120, MS_TOKEN_COMPARISON_NE, MS_TOKEN_COMPARISON_GT, MS_TOKEN_COMPARISON_LT, MS_TOKEN_COMPARISON_LE, MS_TOKEN_COMPARISON_GE, MS_TOKEN_COMPARISON_IEQ,
     46  MS_TOKEN_COMPARISON_RE, MS_TOKEN_COMPARISON_IRE,
     47  MS_TOKEN_COMPARISON_IN, MS_TOKEN_COMPARISON_LIKE,
     48  MS_TOKEN_COMPARISON_INTERSECTS, MS_TOKEN_COMPARISON_DISJOINT, MS_TOKEN_COMPARISON_TOUCHES, MS_TOKEN_COMPARISON_OVERLAPS, MS_TOKEN_COMPARISON_CROSSES, MS_TOKEN_COMPARISON_WITHIN, MS_TOKEN_COMPARISON_CONTAINS,
     49  MS_TOKEN_COMPARISON_BEYOND, MS_TOKEN_COMPARISON_DWITHIN
     50};
     51enum MS_TOKEN_FUNCTION_ENUM { MS_TOKEN_FUNCTION_LENGTH=140, MS_TOKEN_FUNCTION_TOSTRING, MS_TOKEN_FUNCTION_COMMIFY, MS_TOKEN_FUNCTION_AREA, MS_TOKEN_FUNCTION_ROUND, MS_TOKEN_FUNCTION_FROMTEXT };
     52enum MS_TOKEN_BINDING_ENUM { MS_TOKEN_BINDING_DOUBLE=150, MS_TOKEN_BINDING_INTEGER, MS_TOKEN_BINDING_STRING, MS_TOKEN_BINDING_TIME, MS_TOKEN_BINDING_SHAPE };
     53
     54typedef union {
     55  double dblval;
     56  int intval;
     57  char *strval;
     58  struct tm tmval;
     59  shapeObj *shpval;
     60  attributeBindingObj bindval;
     61} tokenValueObj;
     62
     63typedef struct {
     64  int token;
     65  tokenValueObj tokenval;
     66} tokenObj;
     67}}}
     68
     69Some of these definitions hint at other features that will be detailed later. When we convert an expression string into a series of tokens we also store away the value associated with that token (if necessary). In many cases the token value is a literal (string or number), in other cases its a reference to a feature attribute. In the latter case we use the attributeBindingObj already in use by MapServer to encapsulate the information necessary to quickly access the correct data (typically an item index value).
     70
     71We always have had to make expression data available to the parser (and lexer) via temporary global variables. That would continue to be the case here, although different data are shared in this context. One thing to note is that once an expression is tokenized we no longer have to rely on the flex-generated lexer so, in theory, it should be easier to implement a thread-safe parser should we choose to do so.
     72
     73'''Extending the Yacc grammar to support spatial operators'''
     74
     75The mapserver.h definitions above allow for using shapeObj's within the Yacc grammar (we also define a shapeObj as a new base token type within mapparser.y). There are two types of shape-related tokens: 1) a shape binding, that is, a reference to the geometry being evaluated and 2) shape literals, shapes described as WKT within the expression string. For example:
     76
     77{{{
     78In the expression:
     79  EXPRESSION (fromText('POINT(500000 5000000)') Intersects [shape])
     80
     81  1) fromText('POINT(500000 5000000)') defines a shape literal (the WKT to shapeObj conversion is done only once)
     82  2) [shape] is a shape binding
     83}}}
     84
     85We can use these tokens in the grammar to implement all of the MapServer supported (via GEOS) logical operators. Note that in the above example fromText() appears as a function operating on a string. This is handled as a special case when tokenizing the string since we only want to do this once. So we create a shape literal based on the enclosed WKT string at this time.
     86
     87'''Extending the grammar to support spatial functions'''
     88
     89By supporting the use of more complex objects we can support functions on those objects. We could write:
     90
     91{{{
     92  EXPRESSION (area([shape]) < 100000)
     93}}}
     94
     95or rely on even more of the GEOS operators. (Note: only the area function is present in the sandbox.) To do this we need to somehow store a shapeObj's scope so that working copies can be free'd as appropriate. I would propose adding a ''int scope'' to the shapeObj structure definition. Shapes created in the course of expression evaluation would be tagged as having limited scope while literals or bindings would be left alone and presumably destroyed later. This saves having to make copies of shapes which can be expensive.
     96
     97'''Context Sensitive Parsing '''
     98
     99We could have done this all along but this would be an opportune time to implement context sensitive parser use. Presently we expect the parser to produce a true or false result, but certainly aren't limited to that. The idea is to use the parser to compute values in other situations. Two places are working in the sandbox and are detailed below.
     100
     101''Class TEXT Parameter''
     102
     103Currently you can write:
     104
     105{{{
     106CLASS
     107  ...
     108  TEXT ([area] acres)
     109END
     110}}}
     111
     112It looks as if the TEXT value is an expression (and it is stored as such) but it's not evaluated as one. It would be very useful to treat this as a true expression. This would open up a world of formatting options for drivers that don't support it otherwise (e.g. shapefiles). Ticket [http://trac.osgeo.org/mapserver/ticket/2950 2950] is an example where this would come in handy. Within the sandbox I've added ''toString'', ''round'' and ''commify'' functions so that you can write:
     113
     114{{{
     115TEXT (commify(toString([area]*0.000247105381,"%.2f")) + ' ac')
     116}}}
     117
     118Which converts area from sq. meters to acres, truncates the result to two decimal places adds commas (213234.123455 => 213,234.12) for crowd pleasing display. To add this support, in addition to grammar changes, we add these declarations to mapserver.h:
     119
     120{{{
     121enum MS_PARSE_RESULT_TYPE_ENUM { MS_PARSE_RESULT_BOOLEAN, MS_PARSE_RESULT_STRING, MS_PARSE_RESULT_SHAPE };
     122
     123typedef union {
     124  int intval;
     125  char *strval;
     126  shapeObj *shpval;
     127} parseResultObj;
     128}}}
     129
     130Then in the grammar we set a parse result type and set the result accordingly. One side effect is that we have to define a standard way to convert numbers to strings when in the string context and simply using "%g" as a format string for snprintf does wonders to output.
     131
     132''Style GEOMTRANSFORM Parameter''
     133
     134Within the sandbox I've implemented GEOMTRANSFORMs as expressions as opposed to the original implementation. The parser has also been extended to support the GEOS buffer operator. So you can write:
     135
     136{{{
     137STYLE
     138  GEOMTRANSFORM (buffer([shape], -3))
     139  ...
     140END
     141}}}
     142
     143This does executes a buffer on the geometry before rendering (see test.buffer.png attachment). Because the GEOMTRANSFORM processing occurs this transformation happens '''after''' the feature is converted from map to image coordinates, but the effect is still valuable and the buffer value is given in pixels. In the future we might consider implementing a GEOMTRANSFORM at the layer level so that the transformation is available to all classes and/or styles (and consequently in query modes too).
     144
     145== Expression Use Elsewhere ==
     146
     147Currently the logical expression syntax is also used with REQUIRES/LABELREQUIRES and with rasters. In the REQUIRES/LABELREQUIRES case the code would remain mostly "as is" we'd still do the substitutions bases on layer status, then explicitly tokenize and parse. Since this done at most once per layer there's really no need to do anything more.
     148
     149Rasters present more of a challenge. We'd need to handle them as a special case when tokenizing an expression by defining pixel bindings and then pass a pixel to the parser when evaluating the expression. This should be '''much, much''' faster than the current method where each pixel value is converted to a string representation (and then back to a number), especially given the number of pixels often evaluated. Some interesting drawing effects are also possible if you could expose a pixel location to the GEOS operators. For example, one could create a mask showing only pixels within a particular geometry.
     150
     151== Query Impact ==
     152
     153This is where things get interesting. I'm proposing adding a new query, msQueryByFilter() that would work off an expression string (this is working in the sandbox). The expression string would still have to be accompanied by an extent parameter. Drivers like shapefiles still need to first apply a bounding box before applying a secondary filter. Other drivers could choose to combine the extent and expression string (more likely the tokens) if they so choose. msQueryByFilter() works much like msQueryByAttributes(). The layer API has been extended to include a prototype, msLayerSupportsCommonFilters(), that allows the driver to say if it could process this common expression format natively somehow (e.g. via a FILTER and msLayerWhichShapes()/ msLayerNextShape()) or if the expression would need to be applied after msLayerNextShape() is called repeatedly. The shapefile and tiled shapefile drivers would work natively, as would any driver that uses msEvalExpression(). My hope is that the RDBMS drivers could somehow translate (via the tokens) an expression into their native SQL but if not, we would still be able to use those sources.
     154
     155== Backwards Compatibility Issues ==
     156
     157Surprisingly few. The parser changes would all be transparent to the user. Truly handling TEXT expressions as expressions is a regression, albeit a positive one IMHO. I would also propose a few expression level changes especially around case-insensitive string and regex comparisons within logical expression. I think it makes more sense and is more user friendly to simply define case-insensitive operators (e.g. EQ and IEQ for straight string equality, and ~ and ~* for regex (modeled after Postgres).
     158
     159The additional operators, functions and parsing contexts are new functionality.
     160
     161A great deal of code would be made obsolete if this were pursued. Much of the OWS filter evaluation would be handled by the msQueryByFilter() function and numerous associated enums, defines, etc... could go away.
     162
     163Grammar summary ('''bold means new'''):
     164
     165Logical operators: AND, OR, NOT[[BR]]
     166Comparison operators: =, '''=*''', !=, >, <, >=, <=, ~, '''~*''', in[[BR]]
     167Spatial comparison operators: '''intersects''', '''disjoint''', '''touches''', '''overlaps''', '''crosses''', '''within''', '''contains''', '''beyond''', '''dwithin'''[[BR]]
     168Functions: length, '''commify''', '''round''', '''tostring'''[[BR]]
     169Spatial functions: '''fromtext''', '''area''', '''distance''', '''buffer'''
     170
     171== Security Issues ==
     172
     173While the bulk of the work is in the bowels of MapServer any change of this magnitude could have unintended consequences. In this case I think the largest risks are buffer overflows associated with string operators in the parser and memory leaks. Care would need to be taken in developing a comprehensive set of test cases.
     174
     175== Todo's ==
     176
     177  1) The ''IN'' operator is in dire need of optimization.[[BR]]
     178  2) All OGC filter operations need to be supported. Bounding box filters in particular have not been looked at.[[BR]]
     179  3) Need ''LIKE'' operator code. (e.g. Dr. Dobbs, 9/08, ''Matching Wildcards: An Algorithm'', pp. 37-39)[[BR]]
     180  4) How to handle layer tolerances in msQueryByFilter()?[[BR]]
     181  5) Best way to manage tokens, array, list, tree? Bison/Yacc needs array or list, but both Frank and Paul have referred to trees.[[BR]]
     182  6) Thread safety...[[BR]]
     183  7) Parser error handling, any errors have basically always been silently ignored.