Context Navigation

← Previous Change
Wiki History
Next Change →

Changes between Initial Version and Version 1 of rfc23_ogr_unicode

Timestamp:: Apr 23, 2008, 11:21:23 AM (16 years ago)
Author:: warmerdam
Comment:: --

Legend:

: Unmodified
: Added
: Removed
: Modified

rfc23_ogr_unicode

               v1
+= RFC 23: Unicode support in OGR =
+Authors: Frank Warmerdam[[BR]]
+Contact: warmerdam@pobox.com[[BR]]
+Status: Development
+== Summary ==
+This document proposes preliminary steps towards GDAL/OGR handling strings internally in UTF-8, and supporting conversion between different encodings.
+== Main concepts ==
+GDAL should be modified in a way to support three following main ideas:
+. The CPLString class will be upgraded to support a variety of encoding conversions, including conversion between representations (ie. UTF-8 to UCS-16/wchar_t).
+. Character encodings will be identified by iconv() style strings.
+. OFTString/OFTStringList feature attributes in OGR will be treated as being in UTF-8.
+This RFC specifically does not attempt to address issues of using non-ascii filenames.  It also does not attempt to make definitions about the encoding of other strings used in GDAL/OGR (such as field names, metadata, etc).  These would presumably be addressed in a later RFC building on this one.
+== CPLString ==
+The CPLString class will now be assumed to be a potentially multi-byte
+encoded string, but with no effect within the CPLString to keep track of
+the encoding it is in.  This is left to the higher level code for now.
+However, the CPLString is extended with some convenient mechanisms for
+recoding, and for conversion of UTF-8 to/from "wchar_t" (aka UCS-2).
+It is stressed that normal initialization of a CPLString from "const char *"
+does not attempt to do any conversions to/from UTF-8.  This rule is kept,
+in part to minimize string processing costs for the common case.  When encoding
+is believed to be an issue the calling code must keep track.
+{{{
+// Convert the internal string to a new encoding.
+char* CPLString::recode( const char *pszSrcEncoding, const char *pszDstEncoding );
+// Set value based on input encoded string with CPLString set to UTF-8.
+// This is equivelent to normal setting, and then a recode() with a destination
+// encoding of "UTF-8" and thus is just for convenience.
+CPLString &CPLString::SetAsUTF8( const char *pszInput, const char *pszEncoding = "" );
+// Set value based on input encoded string with CPLString set to UTF-8.
+// This is equivelent to normal setting, and then a recode() with a destination
+// encoding of "UTF-8" and thus is just for convenience.
+CPLString &CPLString::SetAsUTF8( const wchar_t *pszInput, const char *pszEncoding = "UCS-2" );
+// Construct UTF-8 string object from array of wchar_t elements.
+CPLString::CPLString( const wchar_t*pszInput, const char *pszEncoding = "UCS-2" );
+// Returns a wchar_t string which becomes the ownership responsibility of
+// the caller (free with CPLFree()).  It is assumed the CPLString is UTF-8.
+wchar_t *CPLString::GetAsWChar( const char *pszDstEncoding = "UCS-2" );
+}}}
+I have specifically avoided additional constructors or casting operators to
+to avoid any possible overloading ambiguities or complication in maintaining
+extra state in the CPLString.  Such services can be added in the future based
+on the above methods if desired.
+== Encoding Names ==
+It is proposed that the encoding names will be the same sorts of names used
+by iconv().  So stuff like "UTF-8", "LATIN5", "CP850" and "ISO_8859-1".  It
+does not appear that these names for encodings are a 1:1 match with C library
+locale names (like "en_CA.utf8" for instance) which may cause some issues.
+Some particular names of interest:
+ * "": The current locale.  Use this when converting from/to the users locale.
+ * "UTF-8": Unicode in multi-byte encoding.  Most of the time this will be
+our internal linga-franca.
+ * "POSIX": I think this is roughly ASCII (perhaps with some extended characters?).
+ * "UCS-2": Two byte unicode.  This is a wide character format and only suitable for use with the wchar_t methods.
+On some systems you can use "iconv --list" to get a list of supported encodings.
+== iconv() ==
+It is proposed to implement the CPLString::recode() method using the iconv()
+and related functions when available.
+There is an excellent implementation of this API as GNU libiconv(), which is
+used by the C libraries on Linux.  Also some operating systems provide the
+iconv() API as part of the C library (all unix?); however, the system iconv()
+often has a restricted set of conversions supported so it may be desirable
+to use libiconv in preference to the system iconv() even when it is available.
+If iconv() is not available, a stub implementation of the CPLString services
+will be provided which:
+ * implements UCS-2 / UTF-8 interconversion using either mbtowc/wctomb, or an implementation derived from <a href="http://www.cl.cam.ac.uk/~mgk25/unicode.html">http://www.cl.cam.ac.uk/~mgk25/unicode.html</a>.
+ * Implements recoding from "" to and from "UTF-8" by doing nothing, but issuing a warning on the first use if the current locale does not appear to be the "C" locale.
+ * Implements recoding from "ASCII" to "UTF-8" as a null operation.
+ * Implements recoding from "UTF-8" to "ASCII" by turning all non-ASCII multi-byte characters to '?'.
+This hopesfully gives us a weak operational status when built without iconv(), but full operation when it is available.
+The --with-iconv=<arg> option will be added to configure.  The argument can be
+the path to a libiconv installation or the special value 'system' indicating
+that the system lib should be used.  Alternatively, --without-iconv can be used
+to avoid using iconv.
+== OFTString/OFTStringList Fields ==
+It is declared that OGR string attribute values will be in UTF-8.  This means
+that OGR drivers are responsible for translating format specific representations
+to UTF-8 when reading, and back to the format specific representation when
+writing.  In many cases (of simple ASCII text) this requires no transformation.
+This implies that the arguments to methods like OGRFeature::SetField( int i, const char *) should be UTF-8, and that GetFieldAsString() will return UTF-8.
+The same issues apply to OFTStringList lists of strings.  Each string will be assumed to be UTF-8.
+== OGR Driver Updates ==
+The following OGR drivers could benefit immediately from recoding to UTF-8
+support in one way or another.
+ * ODBC (add support for wchar_t / NVARSHAR fields)
+ * Shapefile
+ * GML (I'm not sure how the XML encoding values all map to our concept of encoding)
+ * Postgres
+I'm sure a number of the other drivers, particularly the RDBMS drivers, could
+benefit from an update.
+== Implementation ==
+Frank Warmerdam will implement the core iconv() capabilities, the
+CPLString additions and update the ODBC driver.  Other OGR drivers would be
+updated as time and demand mandates to conform to the definitions in this RFC by
+interested developers.
+The core work will be completed for GDAL/OGR 1.6.0 release.
+== References ==
+ * [http://unicode.org/versions/Unicode4.0.0/ch05.pdf The Unicode Standard, Version 4.0 - Implementation Guidelines] - Chapter 5 (PDF)
+ * FAQ on how to use Unicode in software: http://www.cl.cam.ac.uk/~mgk25/unicode.html
+ * FLTK implementation of string conversion functions: http://svn.easysw.com/public/fltk/fltk/trunk/src/utf.c
+ * http://www.easysw.com/~mike/fltk/doc-2.0/html/utf_8h.html
+ * Ticket #1494 : UTF-8 encoding for GML output.
+ * Libiconv: http://www.gnu.org/software/libiconv/