Changes between Initial Version and Version 1 of rfc23_ogr_unicode


Ignore:
Timestamp:
Apr 23, 2008, 11:21:23 AM (16 years ago)
Author:
warmerdam
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • rfc23_ogr_unicode

    v1 v1  
     1= RFC 23: Unicode support in OGR =
     2
     3Authors: Frank Warmerdam[[BR]]
     4Contact: warmerdam@pobox.com[[BR]]
     5Status: Development
     6
     7== Summary ==
     8
     9This document proposes preliminary steps towards GDAL/OGR handling strings internally in UTF-8, and supporting conversion between different encodings. 
     10
     11== Main concepts ==
     12
     13GDAL should be modified in a way to support three following main ideas:
     14
     15 1. The CPLString class will be upgraded to support a variety of encoding conversions, including conversion between representations (ie. UTF-8 to UCS-16/wchar_t).
     16 2. Character encodings will be identified by iconv() style strings.
     17 3. OFTString/OFTStringList feature attributes in OGR will be treated as being in UTF-8.
     18
     19This RFC specifically does not attempt to address issues of using non-ascii filenames.  It also does not attempt to make definitions about the encoding of other strings used in GDAL/OGR (such as field names, metadata, etc).  These would presumably be addressed in a later RFC building on this one.
     20
     21== CPLString ==
     22
     23The CPLString class will now be assumed to be a potentially multi-byte
     24encoded string, but with no effect within the CPLString to keep track of
     25the encoding it is in.  This is left to the higher level code for now.
     26
     27However, the CPLString is extended with some convenient mechanisms for
     28recoding, and for conversion of UTF-8 to/from "wchar_t" (aka UCS-2).
     29
     30It is stressed that normal initialization of a CPLString from "const char *"
     31does not attempt to do any conversions to/from UTF-8.  This rule is kept,
     32in part to minimize string processing costs for the common case.  When encoding
     33is believed to be an issue the calling code must keep track.
     34
     35{{{
     36
     37// Convert the internal string to a new encoding.
     38char* CPLString::recode( const char *pszSrcEncoding, const char *pszDstEncoding );
     39
     40// Set value based on input encoded string with CPLString set to UTF-8.
     41// This is equivelent to normal setting, and then a recode() with a destination
     42// encoding of "UTF-8" and thus is just for convenience.
     43
     44CPLString &CPLString::SetAsUTF8( const char *pszInput, const char *pszEncoding = "" );
     45
     46// Set value based on input encoded string with CPLString set to UTF-8.
     47// This is equivelent to normal setting, and then a recode() with a destination
     48// encoding of "UTF-8" and thus is just for convenience.
     49
     50CPLString &CPLString::SetAsUTF8( const wchar_t *pszInput, const char *pszEncoding = "UCS-2" );
     51
     52// Construct UTF-8 string object from array of wchar_t elements.
     53CPLString::CPLString( const wchar_t*pszInput, const char *pszEncoding = "UCS-2" );
     54
     55// Returns a wchar_t string which becomes the ownership responsibility of
     56// the caller (free with CPLFree()).  It is assumed the CPLString is UTF-8.
     57wchar_t *CPLString::GetAsWChar( const char *pszDstEncoding = "UCS-2" );
     58}}}
     59
     60I have specifically avoided additional constructors or casting operators to
     61to avoid any possible overloading ambiguities or complication in maintaining
     62extra state in the CPLString.  Such services can be added in the future based
     63on the above methods if desired.
     64
     65== Encoding Names ==
     66
     67It is proposed that the encoding names will be the same sorts of names used
     68by iconv().  So stuff like "UTF-8", "LATIN5", "CP850" and "ISO_8859-1".  It
     69does not appear that these names for encodings are a 1:1 match with C library
     70locale names (like "en_CA.utf8" for instance) which may cause some issues.
     71
     72Some particular names of interest:
     73
     74 * "": The current locale.  Use this when converting from/to the users locale.
     75 * "UTF-8": Unicode in multi-byte encoding.  Most of the time this will be
     76our internal linga-franca.
     77 * "POSIX": I think this is roughly ASCII (perhaps with some extended characters?). 
     78 * "UCS-2": Two byte unicode.  This is a wide character format and only suitable for use with the wchar_t methods.
     79
     80On some systems you can use "iconv --list" to get a list of supported encodings.
     81
     82== iconv() ==
     83
     84It is proposed to implement the CPLString::recode() method using the iconv()
     85and related functions when available. 
     86
     87There is an excellent implementation of this API as GNU libiconv(), which is
     88used by the C libraries on Linux.  Also some operating systems provide the
     89iconv() API as part of the C library (all unix?); however, the system iconv()
     90often has a restricted set of conversions supported so it may be desirable
     91to use libiconv in preference to the system iconv() even when it is available.
     92
     93If iconv() is not available, a stub implementation of the CPLString services
     94will be provided which:
     95
     96 * implements UCS-2 / UTF-8 interconversion using either mbtowc/wctomb, or an implementation derived from <a href="http://www.cl.cam.ac.uk/~mgk25/unicode.html">http://www.cl.cam.ac.uk/~mgk25/unicode.html</a>.
     97 * Implements recoding from "" to and from "UTF-8" by doing nothing, but issuing a warning on the first use if the current locale does not appear to be the "C" locale.
     98 * Implements recoding from "ASCII" to "UTF-8" as a null operation.
     99 * Implements recoding from "UTF-8" to "ASCII" by turning all non-ASCII multi-byte characters to '?'.
     100
     101This hopesfully gives us a weak operational status when built without iconv(), but full operation when it is available.
     102
     103The --with-iconv=<arg> option will be added to configure.  The argument can be
     104the path to a libiconv installation or the special value 'system' indicating
     105that the system lib should be used.  Alternatively, --without-iconv can be used
     106to avoid using iconv.
     107
     108== OFTString/OFTStringList Fields ==
     109
     110It is declared that OGR string attribute values will be in UTF-8.  This means
     111that OGR drivers are responsible for translating format specific representations
     112to UTF-8 when reading, and back to the format specific representation when
     113writing.  In many cases (of simple ASCII text) this requires no transformation.
     114
     115This implies that the arguments to methods like OGRFeature::SetField( int i, const char *) should be UTF-8, and that GetFieldAsString() will return UTF-8. 
     116
     117The same issues apply to OFTStringList lists of strings.  Each string will be assumed to be UTF-8.
     118
     119== OGR Driver Updates ==
     120
     121The following OGR drivers could benefit immediately from recoding to UTF-8
     122support in one way or another.
     123
     124 * ODBC (add support for wchar_t / NVARSHAR fields)
     125 * Shapefile
     126 * GML (I'm not sure how the XML encoding values all map to our concept of encoding)
     127 * Postgres
     128
     129I'm sure a number of the other drivers, particularly the RDBMS drivers, could
     130benefit from an update.
     131
     132== Implementation ==
     133
     134Frank Warmerdam will implement the core iconv() capabilities, the
     135CPLString additions and update the ODBC driver.  Other OGR drivers would be
     136updated as time and demand mandates to conform to the definitions in this RFC by
     137interested developers.
     138
     139The core work will be completed for GDAL/OGR 1.6.0 release.
     140
     141== References ==
     142
     143 * [http://unicode.org/versions/Unicode4.0.0/ch05.pdf The Unicode Standard, Version 4.0 - Implementation Guidelines] - Chapter 5 (PDF)
     144 * FAQ on how to use Unicode in software: http://www.cl.cam.ac.uk/~mgk25/unicode.html
     145 * FLTK implementation of string conversion functions: http://svn.easysw.com/public/fltk/fltk/trunk/src/utf.c
     146 * http://www.easysw.com/~mike/fltk/doc-2.0/html/utf_8h.html
     147 * Ticket #1494 : UTF-8 encoding for GML output.
     148 * Libiconv: http://www.gnu.org/software/libiconv/