| 1 | = RFC 23: Unicode support in OGR = |
| 2 | |
| 3 | Authors: Frank Warmerdam[[BR]] |
| 4 | Contact: warmerdam@pobox.com[[BR]] |
| 5 | Status: Development |
| 6 | |
| 7 | == Summary == |
| 8 | |
| 9 | This document proposes preliminary steps towards GDAL/OGR handling strings internally in UTF-8, and supporting conversion between different encodings. |
| 10 | |
| 11 | == Main concepts == |
| 12 | |
| 13 | GDAL should be modified in a way to support three following main ideas: |
| 14 | |
| 15 | 1. The CPLString class will be upgraded to support a variety of encoding conversions, including conversion between representations (ie. UTF-8 to UCS-16/wchar_t). |
| 16 | 2. Character encodings will be identified by iconv() style strings. |
| 17 | 3. OFTString/OFTStringList feature attributes in OGR will be treated as being in UTF-8. |
| 18 | |
| 19 | This RFC specifically does not attempt to address issues of using non-ascii filenames. It also does not attempt to make definitions about the encoding of other strings used in GDAL/OGR (such as field names, metadata, etc). These would presumably be addressed in a later RFC building on this one. |
| 20 | |
| 21 | == CPLString == |
| 22 | |
| 23 | The CPLString class will now be assumed to be a potentially multi-byte |
| 24 | encoded string, but with no effect within the CPLString to keep track of |
| 25 | the encoding it is in. This is left to the higher level code for now. |
| 26 | |
| 27 | However, the CPLString is extended with some convenient mechanisms for |
| 28 | recoding, and for conversion of UTF-8 to/from "wchar_t" (aka UCS-2). |
| 29 | |
| 30 | It is stressed that normal initialization of a CPLString from "const char *" |
| 31 | does not attempt to do any conversions to/from UTF-8. This rule is kept, |
| 32 | in part to minimize string processing costs for the common case. When encoding |
| 33 | is believed to be an issue the calling code must keep track. |
| 34 | |
| 35 | {{{ |
| 36 | |
| 37 | // Convert the internal string to a new encoding. |
| 38 | char* CPLString::recode( const char *pszSrcEncoding, const char *pszDstEncoding ); |
| 39 | |
| 40 | // Set value based on input encoded string with CPLString set to UTF-8. |
| 41 | // This is equivelent to normal setting, and then a recode() with a destination |
| 42 | // encoding of "UTF-8" and thus is just for convenience. |
| 43 | |
| 44 | CPLString &CPLString::SetAsUTF8( const char *pszInput, const char *pszEncoding = "" ); |
| 45 | |
| 46 | // Set value based on input encoded string with CPLString set to UTF-8. |
| 47 | // This is equivelent to normal setting, and then a recode() with a destination |
| 48 | // encoding of "UTF-8" and thus is just for convenience. |
| 49 | |
| 50 | CPLString &CPLString::SetAsUTF8( const wchar_t *pszInput, const char *pszEncoding = "UCS-2" ); |
| 51 | |
| 52 | // Construct UTF-8 string object from array of wchar_t elements. |
| 53 | CPLString::CPLString( const wchar_t*pszInput, const char *pszEncoding = "UCS-2" ); |
| 54 | |
| 55 | // Returns a wchar_t string which becomes the ownership responsibility of |
| 56 | // the caller (free with CPLFree()). It is assumed the CPLString is UTF-8. |
| 57 | wchar_t *CPLString::GetAsWChar( const char *pszDstEncoding = "UCS-2" ); |
| 58 | }}} |
| 59 | |
| 60 | I have specifically avoided additional constructors or casting operators to |
| 61 | to avoid any possible overloading ambiguities or complication in maintaining |
| 62 | extra state in the CPLString. Such services can be added in the future based |
| 63 | on the above methods if desired. |
| 64 | |
| 65 | == Encoding Names == |
| 66 | |
| 67 | It is proposed that the encoding names will be the same sorts of names used |
| 68 | by iconv(). So stuff like "UTF-8", "LATIN5", "CP850" and "ISO_8859-1". It |
| 69 | does not appear that these names for encodings are a 1:1 match with C library |
| 70 | locale names (like "en_CA.utf8" for instance) which may cause some issues. |
| 71 | |
| 72 | Some particular names of interest: |
| 73 | |
| 74 | * "": The current locale. Use this when converting from/to the users locale. |
| 75 | * "UTF-8": Unicode in multi-byte encoding. Most of the time this will be |
| 76 | our internal linga-franca. |
| 77 | * "POSIX": I think this is roughly ASCII (perhaps with some extended characters?). |
| 78 | * "UCS-2": Two byte unicode. This is a wide character format and only suitable for use with the wchar_t methods. |
| 79 | |
| 80 | On some systems you can use "iconv --list" to get a list of supported encodings. |
| 81 | |
| 82 | == iconv() == |
| 83 | |
| 84 | It is proposed to implement the CPLString::recode() method using the iconv() |
| 85 | and related functions when available. |
| 86 | |
| 87 | There is an excellent implementation of this API as GNU libiconv(), which is |
| 88 | used by the C libraries on Linux. Also some operating systems provide the |
| 89 | iconv() API as part of the C library (all unix?); however, the system iconv() |
| 90 | often has a restricted set of conversions supported so it may be desirable |
| 91 | to use libiconv in preference to the system iconv() even when it is available. |
| 92 | |
| 93 | If iconv() is not available, a stub implementation of the CPLString services |
| 94 | will be provided which: |
| 95 | |
| 96 | * implements UCS-2 / UTF-8 interconversion using either mbtowc/wctomb, or an implementation derived from <a href="http://www.cl.cam.ac.uk/~mgk25/unicode.html">http://www.cl.cam.ac.uk/~mgk25/unicode.html</a>. |
| 97 | * Implements recoding from "" to and from "UTF-8" by doing nothing, but issuing a warning on the first use if the current locale does not appear to be the "C" locale. |
| 98 | * Implements recoding from "ASCII" to "UTF-8" as a null operation. |
| 99 | * Implements recoding from "UTF-8" to "ASCII" by turning all non-ASCII multi-byte characters to '?'. |
| 100 | |
| 101 | This hopesfully gives us a weak operational status when built without iconv(), but full operation when it is available. |
| 102 | |
| 103 | The --with-iconv=<arg> option will be added to configure. The argument can be |
| 104 | the path to a libiconv installation or the special value 'system' indicating |
| 105 | that the system lib should be used. Alternatively, --without-iconv can be used |
| 106 | to avoid using iconv. |
| 107 | |
| 108 | == OFTString/OFTStringList Fields == |
| 109 | |
| 110 | It is declared that OGR string attribute values will be in UTF-8. This means |
| 111 | that OGR drivers are responsible for translating format specific representations |
| 112 | to UTF-8 when reading, and back to the format specific representation when |
| 113 | writing. In many cases (of simple ASCII text) this requires no transformation. |
| 114 | |
| 115 | This implies that the arguments to methods like OGRFeature::SetField( int i, const char *) should be UTF-8, and that GetFieldAsString() will return UTF-8. |
| 116 | |
| 117 | The same issues apply to OFTStringList lists of strings. Each string will be assumed to be UTF-8. |
| 118 | |
| 119 | == OGR Driver Updates == |
| 120 | |
| 121 | The following OGR drivers could benefit immediately from recoding to UTF-8 |
| 122 | support in one way or another. |
| 123 | |
| 124 | * ODBC (add support for wchar_t / NVARSHAR fields) |
| 125 | * Shapefile |
| 126 | * GML (I'm not sure how the XML encoding values all map to our concept of encoding) |
| 127 | * Postgres |
| 128 | |
| 129 | I'm sure a number of the other drivers, particularly the RDBMS drivers, could |
| 130 | benefit from an update. |
| 131 | |
| 132 | == Implementation == |
| 133 | |
| 134 | Frank Warmerdam will implement the core iconv() capabilities, the |
| 135 | CPLString additions and update the ODBC driver. Other OGR drivers would be |
| 136 | updated as time and demand mandates to conform to the definitions in this RFC by |
| 137 | interested developers. |
| 138 | |
| 139 | The core work will be completed for GDAL/OGR 1.6.0 release. |
| 140 | |
| 141 | == References == |
| 142 | |
| 143 | * [http://unicode.org/versions/Unicode4.0.0/ch05.pdf The Unicode Standard, Version 4.0 - Implementation Guidelines] - Chapter 5 (PDF) |
| 144 | * FAQ on how to use Unicode in software: http://www.cl.cam.ac.uk/~mgk25/unicode.html |
| 145 | * FLTK implementation of string conversion functions: http://svn.easysw.com/public/fltk/fltk/trunk/src/utf.c |
| 146 | * http://www.easysw.com/~mike/fltk/doc-2.0/html/utf_8h.html |
| 147 | * Ticket #1494 : UTF-8 encoding for GML output. |
| 148 | * Libiconv: http://www.gnu.org/software/libiconv/ |