Opened 19 years ago

Closed 19 years ago

Last modified 19 years ago

#1381 closed defect (fixed)

Support for case-insensitive searches

Reported by: dmorissette Owned by: dmorissette
Priority: high Milestone: 4.8 release
Component: MapServer C Library Version: 4.6
Severity: normal Keywords:
Cc: sgillies@…, bartvde@…, mathieuparent@…, BEN@…

Description

Bart asked us to implement support for case-insensitive searches, this would be
used primarily via the PropertyIsLike operator in WFS, but should also work for
MapServer in general.

The intial idea that was explored was the possibility of using a PROCESSING
directive to tell MapServer to switch to case-insensitive expression matching,
but this was not practical in the end. Here is an excerpt from an exchange I had
with Bart on this and describing our options:

-----

The main issue is that the code that deals with all 3 types of 
expressions is isolated inside msEvalExpression() in maputil.c, but this 
code has no reference to the layer or no easy way to fetch the value of 
a processing directive, ruling out the possibility of using a processing 
directive shortcut. Anyway, a processing directive may not be the best 
way to go for general MapServer use because inside a given layer, you 
may sometimes want some expressions to be case-sensitive and others 
case-insensitive. So you really need control at the expression-level for 
general MapServer use, including a proper way to set the 
case-sensitivity flag at the expression level in a mapfile. Then we'd 
also need some flags that would apply to specific entry points such as 
WFS PropertyIsLike (likely a layer-level metadata) and MapScript's 
queryByAttribute (likely a new arg to the function). Those flags would 
tell those entry points that they need to generate case-insensitive 
expressions that will then be treated as such by msEvalExpression().

I'm not clear yet on how we would encode the case-insensitivity at the 
mapfile level, we could possibly add a suffix to the EXPRESSION token in 
the mapfile if the parser supports that, e.g.

For constants:
  EXPRESSION "someconstant"i

For regex:
  EXPRESSION /someregex/i

For Logical:
  EXPRESSION ([attr1] ~= 'somestring')

Or we could try to turn EXPRESSION into a mapfile object (it's already 
an object internally), e.g.

  CLASS
    EXPRESSION
      "somestring"
      CASESENSITIVE FALSE
    END
    ...
  END


The second may be the preferred option, assuming it is possible to 
continue to support the old EXPRESSION syntax for backwards 
compatibility. We'd have to see. We'd also need to consult with Steve 
and the other developers to get their opinions on the best way to handle 
this.

Change History (22)

comment:1 by dmorissette, 19 years ago

Steve, do you have any suggestion or preference for the way this should be handled?

comment:2 by dmorissette, 19 years ago

Milestone: 4.8 release

comment:3 by sdlime, 19 years ago

For logical expressions this would be easy, just add another operator (the ~= is
already used for regex). I'd look to SQL to see how they handle this. Lemme
think on it.

Steve

comment:4 by BEN@…, 19 years ago

Cc: BEN@… added
With database systems it seems that it depends on the sort order defined. This
is a database setting most of the time. Ofcourse you could handle this yourself
by adding UPPER or LOWER functions to the comparisons, but I guess this could be
costly on performance.

SQL server seems to perform case-insensitive searches by default.

http://vyaskn.tripod.com/case_sensitive_search_in_sql_server.htm

For Oracle there is also a setting (_CI):

http://www.oracle-training.cc/cbo_194.htm

comment:5 by BEN@…, 19 years ago

Cc: pramsey@… added
Added Paul to the cc to get his opinion on the best way to achieve SQL
case-insensitive searches for databases.

comment:6 by mathieuparent@…, 19 years ago

Cc: mathieuparent@… added
op_sys: LinuxAll
rep_platform: PCAll

comment:7 by refractions, 19 years ago

For SQL, you can use "ILIKE" and it should be just fine. However, since all the
WFS processing is done in EXPRESSION blocks, it is irrelevant: mapserver does
not currently push anything except bounding box processing into the backend. 

comment:8 by dmorissette, 19 years ago

Status: newassigned
Steve, and all, it is easy to create a new case-insensitive operator for logical
expressions, but that doesn't help for constant and regex expressions. You know
the mapfile parser better than I do. Which one of the following option would
work best, or do you have any better suggestion?

For constants:
  EXPRESSION "someconstant"i

For regex:
  EXPRESSION /someregex/i

For Logical:
  EXPRESSION ([attr1] ILIKE 'somestring')

Or we could try to turn EXPRESSION into a mapfile object (it's already 
an object internally), e.g.

  CLASS
    EXPRESSION
      "somestring"
      CASESENSITIVE FALSE
    END
    ...
  END


comment:9 by sdlime, 19 years ago

The former would be far less disruptive from a mapfile parsing perspective. We 
could update the lexer to recognize a new type of string or regex, basically a 
new pattern identical to what's there but with the trailing 'i'. I'm not as 
sure if this is a huge deal with MapScript or URL-based configs. It's not for 
regex that's for sure, but with simple strings the quotes are required. I 
suppose this would work:

  $var = "'foo'i";
  $class->setExpression($var);

Steve

comment:10 by dmorissette, 19 years ago

Sounds good. I'll give that a try then. Thanks for your input.

comment:11 by jlacroix, 19 years ago

Seem feasible.

Proposition of implementation:
In the parser, I will add 2 new return values, MS_IREGEX and MS_ISTRING. If they
are return to the loadExpression function, the type will be reset to MS_STRING
or MS_REGEX and I will add a new integer value to the expressionObj to hold the
"case-sensitive" value.

The new parameter will be exp->flag. It will be an integer on which we will use
boolean operator to see if some flag are set. For now only case-insensitive will
be available, but this way allow addition for the future.

For the logical expression, the new operator will work exactly like the EQ
operator, but will work ignore the case of characters in string / string
comparison. I propose to use IEQ instead of ILIKE. ILIKE imply to be able to use
wildcards which is not available.

This will gives us
For constants:
  EXPRESSION "someconstant"i

For regex:
  EXPRESSION /someregex/i

For Logical:
  EXPRESSION ([attr1] IEQ 'somestring')

Any objections/suggestions/amendments?

comment:12 by dmorissette, 19 years ago

Let's call the new expressionObj member 'flags' (with a 's'), and the value for
case-insensitive would be:

/* expressionObj->flags values (Can be OR'ed) */
#define MS_EXPR_CASEINSENSITIVE  1

comment:13 by sdlime, 19 years ago

I wondering if for logical expressions we should instead implement some string 
functions like uc or lc rather than a new operator. You'd do:

  EXPRESSION (uc('[attr1]') EQ uc('somestring'))

Those functions could have additional uses then particularly in labeling...

Steve

comment:14 by jlacroix, 19 years ago

IMHO, uc() and lc() could be very useful, but the IEQ too. So I would keep IEQ
anyway.

I think there will be some performance improvement in using IEQ instead of uc().
With IEQ, we will only use strcasecmp() instead of strcmp(uc(),uc()), it may
make a difference. And if there's a difference, with large datafiles it can
worth to use IEQ.  However, I did not test this, so I may be wrong.

Wadayathink?

PS: substring() could also be useful like uc() and lc(). :)

comment:15 by sdlime, 19 years ago

I just hate to add yet another operator that's all. Most languages with 
expression support do not provide an equivalent of IEQ instead providing 
something like uc. We have no alternative with straight string or regex 
comparisons. I'd be curious on the speed issue...

I agree on the substring function and would like to add a number of others like 
ucfirst, and a replace-like function. But that's another bug...

Steve

comment:16 by dmorissette, 19 years ago

I would tend to agree with Julien that strcasecmp() would be faster than
strcmp(uc(),uc()), he will make some tests to verify how important the
difference is.

I also find that IEQ makes the expressions much more compact and readable, plus
there has to be a difference in processing cost between parsing "(uc('[attr1]')
EQ uc('somestring'))"  and  "([attr1] IEQ 'somestring')"


comment:17 by jlacroix, 19 years ago

I made a very quick test to see if strcasecmp is faster or not than the uc()
method. It seems more than 2 times faster. For 1000000 comparison, with always
the same strings ("AbCdEfGhI" and "abcdefghi"), it takes around 0.241 seconds to
run the upper casing routine and 0.034 seconds with strcasecmp (2.232 and 0.847
for 10000000 comparisons).

Note that one of the uppercasing was outside the loop which is not necessary the
case in mapserver. So it's not a real life test. The two strings that match is
the worst case one. Also, there may be some caching involved with the processor
that helps the strcasecmp function since we do always the exact same operation
in a row. And I think that in MapServer environment, we rarelly, I hope, compare
10M strings with logical expressions. So Some may not care about this kind of
speed difference.

But still, it's faster. :)

Why don't you like to add new operators? Does it slow the parser or something?

comment:18 by jlacroix, 19 years ago

Here's the code I used to test (main, include and variable declaration striped):

strcmp.c:

    char *s1=NULL;
    s1 = (char*)strdup("AbCdEfGhI");
    char *s2=NULL;
    s2 = (char*)strdup("abcdefghi");

    j=0;
    while(s2[j]!='\0')
    {
        s2[j] = toupper(s2[j]);
        j++;
    }
    for(i=0; i<1000000; i++)
    {
        j=0;
        while(s1[j]!='\0')
        {
            s1[j] = toupper(s1[j]);
            j++;
        }
        res = strcmp(s1, s2);
    }

strcasecmp.c:

    char *s1=NULL;
    s1 = (char*)strdup("AbCdEfGhI");
    char *s2=NULL;
    s2 = (char*)strdup("abcdefghi");

    for(i=0; i<1000000; i++)
    {
        res = strcmp(s1, s2);
    }

comment:19 by sgillies@…, 19 years ago

Cc: sgillies@… added

comment:20 by sdlime, 19 years ago

Ok, I'm convinced. Go ahead with the new operator...

Steve

comment:21 by jlacroix, 19 years ago

Resolution: fixed
Status: assignedclosed
I commited my changes to the parser fr ""i, //i and IEQ. With the modification
to the parser, the change work in PHP/MapScript without any modification.

Sean, can you test this in SWIG mapscript too?

Marking as FIXED
I opened bug 1416 for the PropertyIsLike in WFS. This one is already quite long.

comment:22 by jlacroix, 19 years ago

blocked: 1416
Note: See TracTickets for help on using tickets.