#1381 closed defect (fixed)
Support for case-insensitive searches
Reported by: | dmorissette | Owned by: | dmorissette |
---|---|---|---|
Priority: | high | Milestone: | 4.8 release |
Component: | MapServer C Library | Version: | 4.6 |
Severity: | normal | Keywords: | |
Cc: | sgillies@…, bartvde@…, mathieuparent@…, BEN@… |
Description
Bart asked us to implement support for case-insensitive searches, this would be used primarily via the PropertyIsLike operator in WFS, but should also work for MapServer in general. The intial idea that was explored was the possibility of using a PROCESSING directive to tell MapServer to switch to case-insensitive expression matching, but this was not practical in the end. Here is an excerpt from an exchange I had with Bart on this and describing our options: ----- The main issue is that the code that deals with all 3 types of expressions is isolated inside msEvalExpression() in maputil.c, but this code has no reference to the layer or no easy way to fetch the value of a processing directive, ruling out the possibility of using a processing directive shortcut. Anyway, a processing directive may not be the best way to go for general MapServer use because inside a given layer, you may sometimes want some expressions to be case-sensitive and others case-insensitive. So you really need control at the expression-level for general MapServer use, including a proper way to set the case-sensitivity flag at the expression level in a mapfile. Then we'd also need some flags that would apply to specific entry points such as WFS PropertyIsLike (likely a layer-level metadata) and MapScript's queryByAttribute (likely a new arg to the function). Those flags would tell those entry points that they need to generate case-insensitive expressions that will then be treated as such by msEvalExpression(). I'm not clear yet on how we would encode the case-insensitivity at the mapfile level, we could possibly add a suffix to the EXPRESSION token in the mapfile if the parser supports that, e.g. For constants: EXPRESSION "someconstant"i For regex: EXPRESSION /someregex/i For Logical: EXPRESSION ([attr1] ~= 'somestring') Or we could try to turn EXPRESSION into a mapfile object (it's already an object internally), e.g. CLASS EXPRESSION "somestring" CASESENSITIVE FALSE END ... END The second may be the preferred option, assuming it is possible to continue to support the old EXPRESSION syntax for backwards compatibility. We'd have to see. We'd also need to consult with Steve and the other developers to get their opinions on the best way to handle this.
Change History (22)
comment:2 by , 19 years ago
Milestone: | → 4.8 release |
---|
comment:3 by , 19 years ago
For logical expressions this would be easy, just add another operator (the ~= is already used for regex). I'd look to SQL to see how they handle this. Lemme think on it. Steve
comment:4 by , 19 years ago
Cc: | added |
---|
With database systems it seems that it depends on the sort order defined. This is a database setting most of the time. Ofcourse you could handle this yourself by adding UPPER or LOWER functions to the comparisons, but I guess this could be costly on performance. SQL server seems to perform case-insensitive searches by default. http://vyaskn.tripod.com/case_sensitive_search_in_sql_server.htm For Oracle there is also a setting (_CI): http://www.oracle-training.cc/cbo_194.htm
comment:5 by , 19 years ago
Cc: | added |
---|
Added Paul to the cc to get his opinion on the best way to achieve SQL case-insensitive searches for databases.
comment:6 by , 19 years ago
Cc: | added |
---|---|
op_sys: | Linux → All |
rep_platform: | PC → All |
comment:7 by , 19 years ago
For SQL, you can use "ILIKE" and it should be just fine. However, since all the WFS processing is done in EXPRESSION blocks, it is irrelevant: mapserver does not currently push anything except bounding box processing into the backend.
comment:8 by , 19 years ago
Status: | new → assigned |
---|
Steve, and all, it is easy to create a new case-insensitive operator for logical expressions, but that doesn't help for constant and regex expressions. You know the mapfile parser better than I do. Which one of the following option would work best, or do you have any better suggestion? For constants: EXPRESSION "someconstant"i For regex: EXPRESSION /someregex/i For Logical: EXPRESSION ([attr1] ILIKE 'somestring') Or we could try to turn EXPRESSION into a mapfile object (it's already an object internally), e.g. CLASS EXPRESSION "somestring" CASESENSITIVE FALSE END ... END
comment:9 by , 19 years ago
The former would be far less disruptive from a mapfile parsing perspective. We could update the lexer to recognize a new type of string or regex, basically a new pattern identical to what's there but with the trailing 'i'. I'm not as sure if this is a huge deal with MapScript or URL-based configs. It's not for regex that's for sure, but with simple strings the quotes are required. I suppose this would work: $var = "'foo'i"; $class->setExpression($var); Steve
comment:11 by , 19 years ago
Seem feasible. Proposition of implementation: In the parser, I will add 2 new return values, MS_IREGEX and MS_ISTRING. If they are return to the loadExpression function, the type will be reset to MS_STRING or MS_REGEX and I will add a new integer value to the expressionObj to hold the "case-sensitive" value. The new parameter will be exp->flag. It will be an integer on which we will use boolean operator to see if some flag are set. For now only case-insensitive will be available, but this way allow addition for the future. For the logical expression, the new operator will work exactly like the EQ operator, but will work ignore the case of characters in string / string comparison. I propose to use IEQ instead of ILIKE. ILIKE imply to be able to use wildcards which is not available. This will gives us For constants: EXPRESSION "someconstant"i For regex: EXPRESSION /someregex/i For Logical: EXPRESSION ([attr1] IEQ 'somestring') Any objections/suggestions/amendments?
comment:12 by , 19 years ago
Let's call the new expressionObj member 'flags' (with a 's'), and the value for case-insensitive would be: /* expressionObj->flags values (Can be OR'ed) */ #define MS_EXPR_CASEINSENSITIVE 1
comment:13 by , 19 years ago
I wondering if for logical expressions we should instead implement some string functions like uc or lc rather than a new operator. You'd do: EXPRESSION (uc('[attr1]') EQ uc('somestring')) Those functions could have additional uses then particularly in labeling... Steve
comment:14 by , 19 years ago
IMHO, uc() and lc() could be very useful, but the IEQ too. So I would keep IEQ anyway. I think there will be some performance improvement in using IEQ instead of uc(). With IEQ, we will only use strcasecmp() instead of strcmp(uc(),uc()), it may make a difference. And if there's a difference, with large datafiles it can worth to use IEQ. However, I did not test this, so I may be wrong. Wadayathink? PS: substring() could also be useful like uc() and lc(). :)
comment:15 by , 19 years ago
I just hate to add yet another operator that's all. Most languages with expression support do not provide an equivalent of IEQ instead providing something like uc. We have no alternative with straight string or regex comparisons. I'd be curious on the speed issue... I agree on the substring function and would like to add a number of others like ucfirst, and a replace-like function. But that's another bug... Steve
comment:16 by , 19 years ago
I would tend to agree with Julien that strcasecmp() would be faster than strcmp(uc(),uc()), he will make some tests to verify how important the difference is. I also find that IEQ makes the expressions much more compact and readable, plus there has to be a difference in processing cost between parsing "(uc('[attr1]') EQ uc('somestring'))" and "([attr1] IEQ 'somestring')"
comment:17 by , 19 years ago
I made a very quick test to see if strcasecmp is faster or not than the uc() method. It seems more than 2 times faster. For 1000000 comparison, with always the same strings ("AbCdEfGhI" and "abcdefghi"), it takes around 0.241 seconds to run the upper casing routine and 0.034 seconds with strcasecmp (2.232 and 0.847 for 10000000 comparisons). Note that one of the uppercasing was outside the loop which is not necessary the case in mapserver. So it's not a real life test. The two strings that match is the worst case one. Also, there may be some caching involved with the processor that helps the strcasecmp function since we do always the exact same operation in a row. And I think that in MapServer environment, we rarelly, I hope, compare 10M strings with logical expressions. So Some may not care about this kind of speed difference. But still, it's faster. :) Why don't you like to add new operators? Does it slow the parser or something?
comment:18 by , 19 years ago
Here's the code I used to test (main, include and variable declaration striped): strcmp.c: char *s1=NULL; s1 = (char*)strdup("AbCdEfGhI"); char *s2=NULL; s2 = (char*)strdup("abcdefghi"); j=0; while(s2[j]!='\0') { s2[j] = toupper(s2[j]); j++; } for(i=0; i<1000000; i++) { j=0; while(s1[j]!='\0') { s1[j] = toupper(s1[j]); j++; } res = strcmp(s1, s2); } strcasecmp.c: char *s1=NULL; s1 = (char*)strdup("AbCdEfGhI"); char *s2=NULL; s2 = (char*)strdup("abcdefghi"); for(i=0; i<1000000; i++) { res = strcmp(s1, s2); }
comment:19 by , 19 years ago
Cc: | added |
---|
comment:21 by , 19 years ago
Resolution: | → fixed |
---|---|
Status: | assigned → closed |
I commited my changes to the parser fr ""i, //i and IEQ. With the modification to the parser, the change work in PHP/MapScript without any modification. Sean, can you test this in SWIG mapscript too? Marking as FIXED I opened bug 1416 for the PropertyIsLike in WFS. This one is already quite long.
comment:22 by , 19 years ago
blocked: | → 1416 |
---|
Note:
See TracTickets
for help on using tickets.