Opened 13 years ago

Closed 13 years ago

#3950 closed enhancement (fixed)

[PATCH]: Add CPLRecode() implementation based on iconv() function

Reported by: dron Owned by: dron
Priority: normal Milestone:
Component: default Version: unspecified
Severity: normal Keywords: iconv
Cc: warmerdam, jdenisgiguere

Description

RPC 23: Unicode support in OGR suggests to introduce the extended CPLRecode() implementation which can do various conversions using iconv() function. There is a patch implementing this. I have added a separate module cpl_recode_iconv.cpp doing the conversion and frontend module cpl_recode.cpp which falls back to cpl_recode_stub.cpp if iconv() is not available at compile time.

What is NOT done in this patch:

  1. Unix configure stuff (pretty trivial);
  1. Windows configure stuff (not so trivial);
  1. Workarounds for iconv() differences on various platforms (could be done per further reports).

I am ready to proceed with applying the patch and implementing proper configuration if there will be no critics/objections.

Best regards, Andrey

Attachments (2)

cpl_recode.diff (26.7 KB ) - added by dron 13 years ago.
Patch adding the iconv()-based CPLRecode() implementation
gdal_recode_perf.c (1.9 KB ) - added by dron 13 years ago.
Benchmark for text encoding functions

Download all attachments as: .zip

Change History (10)

by dron, 13 years ago

Attachment: cpl_recode.diff added

Patch adding the iconv()-based CPLRecode() implementation

comment:1 by Even Rouault, 13 years ago

Andrey,

do you know if there is any performance impact on always using iconv rather than the stub methods for the few cases they already handle ? I mean utf8 <--> iso-8859-1 conversion is already coded : is there any negative (or positive !) performance impact on delegating it to iconv ? We could for example use iconv in the cases where we don't have a sensible implementation like in CPLRecode()

#ifdef HAVE_ICONV
        return CPLRecodeIconv(pszSource, pszSrcEncoding, pszDstEncoding);
#else
/* -------------------------------------------------------------------- */
/*      Everything else is treated as a no-op with a warning.           */
/* -------------------------------------------------------------------- */
    {
        static int bHaveWarned = FALSE;

        if( !bHaveWarned )
        {
            bHaveWarned = 1;
            CPLError( CE_Warning, CPLE_AppDefined, 
                      "Recode from %s to %s not supported, no change applied.", 
                      pszSrcEncoding, pszDstEncoding );
        }
        
        return CPLStrdup(pszSource);
    }
#endif

I somehow remember I started something like that in a working copy but I can't find it anymore.

As far as iconv difference, I remember that on some platforms, you must use libiconv_foo rather than iconv_foo.

comment:2 by dron, 13 years ago

Rouault,

That's an interesting question. There is indeed a performance penalty introduced by iconv() function (I am using the glibc's internal implementation now). I have created a simple benchmark (the source code is attached for reference) which reads some text file in memory and does the conversions on that string. I am using GCC manual as a test file which is large enough. There are results of multiple conversions:

$ gcc -g3 -O2 gdal_recode_perf.c -o gdal_recode_perf -lgdal

$ ls -l /tmp/gcc-4.4.info 
-rw-r--r-- 1 dron dron 1938398 May  3  2010 /tmp/gcc-4.4.info

$ ./gdal_recode_perf /tmp/gcc-4.4.info 10000
10000 iterations of CPLRecodeIconv() from UTF-8 to ISO-8859-1 took 119 seconds.
10000 iterations of CPLRecodeStub() from UTF-8 to ISO-8859-1 took 106 seconds.
10000 iterations of CPLRecodeIconv() from ISO-8859-1 to UTF-8 took 103 seconds.
10000 iterations of CPLRecodeStub() from ISO-8859-1 to UTF-8 took 84 seconds.

Subsequent runs produce the same results. So there is a ~10% penalty for UTF-8 to ISO-8859-1 conversion and about 20% for reverse operation. I can go the way you suggested: use existing implementation for appropriate conversions and iconv() for all other ones.

Best regards, Andrey

by dron, 13 years ago

Attachment: gdal_recode_perf.c added

Benchmark for text encoding functions

comment:3 by jdenisgiguere, 13 years ago

Cc: jdenisgiguere added

comment:4 by Even Rouault, 13 years ago

Andrey,

A few points on r21730:

  • You've removed port/cpl_config_extras.h from AC_CONFIG_HEADER(port/cpl_config.h:port/cpl_config.h.in:port/cpl_config_extras.h) in configure.in. Apparently cpl_config_extras.h was added in r18144 for MacOSX support, so I think there's no reason to remove it.
  • IMHO The new methods xxxIconv or xxxStub have no reason to be CPL_DLL exported. They should remain internal GDAL API
  • I think (untested, only quick code review) that the implementation of CPLRecodeFromWCharIconv() is incorrect. You cannot safely cast the const wchar_t *pwszSource to a const char*. In particularly CPLRecodeIconv() calls strlen() on that pointer, so any character 0 will make it stop. The Stub() implementation does a "while( pwszSource[nSrcLen] != 0 ) nSrcLen++;" to compute the length of a wide char string.

Best regards,

Even

in reply to:  4 comment:5 by dron, 13 years ago

Roualt, thanks for the code review.

Replying to rouault:

  • You've removed port/cpl_config_extras.h from AC_CONFIG_HEADER(port/cpl_config.h:port/cpl_config.h.in:port/cpl_config_extras.h) in configure.in. Apparently cpl_config_extras.h was added in r18144 for MacOSX support, so I think there's no reason to remove it.

Ah, actually it doesn't work with my autoheader. r21735 should restore its inclusion and to be slightly better solution.

  • IMHO The new methods xxxIconv or xxxStub have no reason to be CPL_DLL exported. They should remain internal GDAL API

Agreed. Should be fixed with r21736.

More to come.

Best regards, Andrey

in reply to:  4 comment:6 by dron, 13 years ago

Even,

Replying to rouault:

  • I think (untested, only quick code review) that the implementation of CPLRecodeFromWCharIconv() is incorrect. You cannot safely cast the const wchar_t *pwszSource to a const char*. In particularly CPLRecodeIconv() calls strlen() on that pointer, so any character 0 will make it stop. The Stub() implementation does a "while( pwszSource[nSrcLen] != 0 ) nSrcLen++;" to compute the length of a wide char string.

I think it should be fixed with r21738. We still need a proper test suite though.

Best regards, Andrey

comment:7 by Even Rouault, 13 years ago

r21755 /trunk/gdal/ (4 files in 2 dirs): Fix failure on gml_8 with Xerces (#3950)

comment:8 by dron, 13 years ago

Resolution: fixed
Status: newclosed

I have added a simple test case with r21786. Consider this ticked to be closed for now

Note: See TracTickets for help on using tickets.