Opened 6 years ago

Closed 6 years ago

#3950 closed enhancement (fixed)

[PATCH]: Add CPLRecode() implementation based on iconv() function

Reported by: dron Owned by: dron
Priority: normal Milestone:
Component: default Version: unspecified
Severity: normal Keywords: iconv
Cc: warmerdam, jdenisgiguere

Description

RPC 23: Unicode support in OGR suggests to introduce the extended CPLRecode() implementation which can do various conversions using iconv() function. There is a patch implementing this. I have added a separate module cpl_recode_iconv.cpp doing the conversion and frontend module cpl_recode.cpp which falls back to cpl_recode_stub.cpp if iconv() is not available at compile time.

What is NOT done in this patch:

  1. Unix configure stuff (pretty trivial);
  1. Windows configure stuff (not so trivial);
  1. Workarounds for iconv() differences on various platforms (could be done per further reports).

I am ready to proceed with applying the patch and implementing proper configuration if there will be no critics/objections.

Best regards, Andrey

Attachments (2)

cpl_recode.diff (26.7 KB) - added by dron 6 years ago.
Patch adding the iconv()-based CPLRecode() implementation
gdal_recode_perf.c (1.9 KB) - added by dron 6 years ago.
Benchmark for text encoding functions

Download all attachments as: .zip

Change History (10)

Changed 6 years ago by dron

Patch adding the iconv()-based CPLRecode() implementation

comment:1 Changed 6 years ago by rouault

Andrey,

do you know if there is any performance impact on always using iconv rather than the stub methods for the few cases they already handle ? I mean utf8 <--> iso-8859-1 conversion is already coded : is there any negative (or positive !) performance impact on delegating it to iconv ? We could for example use iconv in the cases where we don't have a sensible implementation like in CPLRecode()

#ifdef HAVE_ICONV
        return CPLRecodeIconv(pszSource, pszSrcEncoding, pszDstEncoding);
#else
/* -------------------------------------------------------------------- */
/*      Everything else is treated as a no-op with a warning.           */
/* -------------------------------------------------------------------- */
    {
        static int bHaveWarned = FALSE;

        if( !bHaveWarned )
        {
            bHaveWarned = 1;
            CPLError( CE_Warning, CPLE_AppDefined, 
                      "Recode from %s to %s not supported, no change applied.", 
                      pszSrcEncoding, pszDstEncoding );
        }
        
        return CPLStrdup(pszSource);
    }
#endif

I somehow remember I started something like that in a working copy but I can't find it anymore.

As far as iconv difference, I remember that on some platforms, you must use libiconv_foo rather than iconv_foo.

comment:2 Changed 6 years ago by dron

Rouault,

That's an interesting question. There is indeed a performance penalty introduced by iconv() function (I am using the glibc's internal implementation now). I have created a simple benchmark (the source code is attached for reference) which reads some text file in memory and does the conversions on that string. I am using GCC manual as a test file which is large enough. There are results of multiple conversions:

$ gcc -g3 -O2 gdal_recode_perf.c -o gdal_recode_perf -lgdal

$ ls -l /tmp/gcc-4.4.info 
-rw-r--r-- 1 dron dron 1938398 May  3  2010 /tmp/gcc-4.4.info

$ ./gdal_recode_perf /tmp/gcc-4.4.info 10000
10000 iterations of CPLRecodeIconv() from UTF-8 to ISO-8859-1 took 119 seconds.
10000 iterations of CPLRecodeStub() from UTF-8 to ISO-8859-1 took 106 seconds.
10000 iterations of CPLRecodeIconv() from ISO-8859-1 to UTF-8 took 103 seconds.
10000 iterations of CPLRecodeStub() from ISO-8859-1 to UTF-8 took 84 seconds.

Subsequent runs produce the same results. So there is a ~10% penalty for UTF-8 to ISO-8859-1 conversion and about 20% for reverse operation. I can go the way you suggested: use existing implementation for appropriate conversions and iconv() for all other ones.

Best regards, Andrey

Changed 6 years ago by dron

Benchmark for text encoding functions

comment:3 Changed 6 years ago by jdenisgiguere

  • Cc jdenisgiguere added

comment:4 follow-ups: Changed 6 years ago by rouault

Andrey,

A few points on r21730:

  • You've removed port/cpl_config_extras.h from AC_CONFIG_HEADER(port/cpl_config.h:port/cpl_config.h.in:port/cpl_config_extras.h) in configure.in. Apparently cpl_config_extras.h was added in r18144 for MacOSX support, so I think there's no reason to remove it.
  • IMHO The new methods xxxIconv or xxxStub have no reason to be CPL_DLL exported. They should remain internal GDAL API
  • I think (untested, only quick code review) that the implementation of CPLRecodeFromWCharIconv() is incorrect. You cannot safely cast the const wchar_t *pwszSource to a const char*. In particularly CPLRecodeIconv() calls strlen() on that pointer, so any character 0 will make it stop. The Stub() implementation does a "while( pwszSource[nSrcLen] != 0 ) nSrcLen++;" to compute the length of a wide char string.

Best regards,

Even

comment:5 in reply to: ↑ 4 Changed 6 years ago by dron

Roualt, thanks for the code review.

Replying to rouault:

  • You've removed port/cpl_config_extras.h from AC_CONFIG_HEADER(port/cpl_config.h:port/cpl_config.h.in:port/cpl_config_extras.h) in configure.in. Apparently cpl_config_extras.h was added in r18144 for MacOSX support, so I think there's no reason to remove it.

Ah, actually it doesn't work with my autoheader. r21735 should restore its inclusion and to be slightly better solution.

  • IMHO The new methods xxxIconv or xxxStub have no reason to be CPL_DLL exported. They should remain internal GDAL API

Agreed. Should be fixed with r21736.

More to come.

Best regards, Andrey

comment:6 in reply to: ↑ 4 Changed 6 years ago by dron

Even,

Replying to rouault:

  • I think (untested, only quick code review) that the implementation of CPLRecodeFromWCharIconv() is incorrect. You cannot safely cast the const wchar_t *pwszSource to a const char*. In particularly CPLRecodeIconv() calls strlen() on that pointer, so any character 0 will make it stop. The Stub() implementation does a "while( pwszSource[nSrcLen] != 0 ) nSrcLen++;" to compute the length of a wide char string.

I think it should be fixed with r21738. We still need a proper test suite though.

Best regards, Andrey

comment:7 Changed 6 years ago by rouault

r21755 /trunk/gdal/ (4 files in 2 dirs): Fix failure on gml_8 with Xerces (#3950)

comment:8 Changed 6 years ago by dron

  • Resolution set to fixed
  • Status changed from new to closed

I have added a simple test case with r21786. Consider this ticked to be closed for now

Note: See TracTickets for help on using tickets.