| 1 | = RFC 5: Unicode support in GDAL = |
| 2 | |
| 3 | Author: Andrey Kiselev[[BR]] |
| 4 | Contact: dron@ak4719.spb.edu[[BR]] |
| 5 | Status: Development |
| 6 | |
| 7 | == Summary == |
| 8 | |
| 9 | This document contains proposal on how to make GDAL core locale independent |
| 10 | preserving support for native character sets. |
| 11 | |
| 12 | == Main concepts == |
| 13 | |
| 14 | GDAL should be modified in a way to support three following main ideas: |
| 15 | |
| 16 | 1. Users work in localized environment using their native languages. That means we can not assume ASCII character set when working with string data passed to GDAL. |
| 17 | 2. GDAL uses UTF-8 encoding internally when working with strings. |
| 18 | 3. GDAL uses Unicode version of third-party APIs when it is possible. |
| 19 | |
| 20 | So all strings, used in GDAL, are in UTF-8, not in plain ASCII. That means we |
| 21 | should convert user's input from the local encoding to UTF-8 during |
| 22 | interactive sessions. The opposite should be done for GDAL output. For |
| 23 | example, when user passes a filename as a command-line parameter to GDAL |
| 24 | utilities, that filename should be immediately converted to UTF-8 and only |
| 25 | afetrwards passed to functions like GDALOpen() or OGROpen(). All functions, |
| 26 | wich take character strings as parameters, assume UTF-8 (with except of |
| 27 | several ones, which will do the conversion between different encodings, see |
| 28 | Implementation). The same is valid for output functions. Output |
| 29 | functions (CPLError/CPLDebug), embedded in GDAL, should convert all strings |
| 30 | from UTF-8 to local encoding just before printing them. Custom error handlers |
| 31 | should be aware of UTF-8 issue and do the proper transformation of strings |
| 32 | passed to them. |
| 33 | |
| 34 | The string encoding pops up again when GDAL needs to call the third-party API. |
| 35 | UTF-8 should be converted to encoding suitable for that API. In particular, |
| 36 | that means we should convert UTF-8 to UTF-16 before calling CreateFile() |
| 37 | function in Windows implementation of VSIFOpenL(). Another example is a |
| 38 | PostgreSQL API. PostgreSQL stores strings in UTF-8 encoding internally, so we |
| 39 | should notify server that passed string is already in UTF-8 and it will be |
| 40 | stored as is without any conversions and losses. |
| 41 | |
| 42 | For file format drivers the string representation should be worked out on |
| 43 | per-driver basis. Not all file formats support non-ASCII characters. For |
| 44 | example, various .HDR labeled rasters are just 7-bit ASCII text files and it |
| 45 | is not a good idea to write 8-bit strings in such a files. When we need to |
| 46 | pass strings, extracted from such file outside the driver (e.g., in |
| 47 | SetMetadata() call), we should convert them to UTF-8. If you just want to use |
| 48 | extracted strings internally in driver, there is no need in any conversions. |
| 49 | |
| 50 | In some cases the file encoding can differ from the local system encoding and |
| 51 | we do not have a way to know the file encoding other than ask a user (for |
| 52 | example, imagine a case when someone added a 8-bit non-ASCII string field to |
| 53 | mentioned above plain text .HDR file). That means we can't use conversion form |
| 54 | the local encoding to UTF-8, but from the file encoding to UTF-8. So we need a |
| 55 | way to get file encoding in some way on per datasource basis. The natural |
| 56 | solution of the problem is to introduce optional open parameter "ENCODING" to |
| 57 | GDALOpen/OGROpen functions. Unfortunately, those functions do not accept |
| 58 | options. That should be introduced in another RFC. Fortunately, tehre is no |
| 59 | need to add encoding parameter immediately, because it is independent from the |
| 60 | general i18n process. We can add UTF-8 support as it is defined in this RFC |
| 61 | and add support for forcing per-datasource encoding later, when the open |
| 62 | options will be introduced. |
| 63 | |
| 64 | == Implementation == |
| 65 | |
| 66 | * New character conversion functions will be introduced in CPLString class. Objects of that class always contain UTF-8 string internally. |
| 67 | {{{ |
| 68 | |
| 69 | // Get string in local encoding from the internal UTF-8 encoded string. |
| 70 | // Out-of-range characters replaced with '?' in output string. |
| 71 | // nEncoding A codename of encoding. If 0 the local system |
| 72 | // encoding will be used. |
| 73 | char* CPLString::recode( int nEncoding = 0 ); |
| 74 | |
| 75 | // Construct UTF-8 string object from string in other encoding |
| 76 | // nEncoding A codename of encoding. If 0 the local system |
| 77 | // encoding will be used. |
| 78 | CPLString::CPLString( const char*, int nEncoding ); |
| 79 | |
| 80 | // Construct UTF-8 string object from array of wchar_t elements. |
| 81 | // Source encoding is system specific. |
| 82 | CPLString::CPLString( wchar_t* ); |
| 83 | |
| 84 | // Get string from UTF-8 encoding into array of wchar_t elements. |
| 85 | // Destination encoding is system specific. |
| 86 | operator wchar_t* (void) const; |
| 87 | }}} |
| 88 | |
| 89 | * In order to use non-ASCII characters in user input every application should call setlocale(LC_ALL, "") function right after the entry point. |
| 90 | |
| 91 | * Code example. Let's look how the gdal utilities and core code should be changed in regard to Unicode. |
| 92 | |
| 93 | For input instead of |
| 94 | |
| 95 | {{{ |
| 96 | pszFilename = argv[i]; |
| 97 | if( pszFilename ) |
| 98 | hDataset = GDALOpen( pszFilename, GA_ReadOnly ); |
| 99 | }}} |
| 100 | |
| 101 | we should do |
| 102 | |
| 103 | {{{ |
| 104 | |
| 105 | CPLString oFilename(argv[i], 0); // <-- Conversion from local encoding to UTF-8 |
| 106 | hDataset = GDALOpen( oFilename.c_str(), GA_ReadOnly ); |
| 107 | |
| 108 | }}} |
| 109 | |
| 110 | For output instead of |
| 111 | |
| 112 | {{{ |
| 113 | |
| 114 | printf( "Description = %s\n", GDALGetDescription(hBand) ); |
| 115 | |
| 116 | }}} |
| 117 | |
| 118 | we should do |
| 119 | |
| 120 | {{{ |
| 121 | |
| 122 | CPLString oDescription( GDALGetDescription(hBand) ); |
| 123 | printf( "Description = %s\n", oDescription.recode( 0 ) ); // <-- Conversion |
| 124 | // from UTF-8 to local |
| 125 | |
| 126 | }}} |
| 127 | |
| 128 | The filename passed to GDALOpen() in UTF-8 encoding in the code snippet |
| 129 | above will be further processed in the GDAL core. On Windows instead of |
| 130 | |
| 131 | {{{ |
| 132 | |
| 133 | hFile = CreateFile( pszFilename, dwDesiredAccess, |
| 134 | FILE_SHARE_READ | FILE_SHARE_WRITE, NULL, dwCreationDisposition, |
| 135 | dwFlagsAndAttributes, NULL ); |
| 136 | |
| 137 | }}} |
| 138 | |
| 139 | we do |
| 140 | |
| 141 | {{{ |
| 142 | |
| 143 | CPLString oFilename( pszFilename ); |
| 144 | // I am prefer call the wide character version explicitly |
| 145 | // rather than specify _UNICODE switch. |
| 146 | hFile = CreateFileW( (wchar_t *)oFilename, dwDesiredAccess, |
| 147 | FILE_SHARE_READ | FILE_SHARE_WRITE, NULL, |
| 148 | dwCreationDisposition, dwFlagsAndAttributes, NULL ); |
| 149 | |
| 150 | }}} |
| 151 | |
| 152 | * The actual implementation of the character conversion functions does not specified in this document yet. It needs additional discussion. The main problem is that we need not only local<->UTF-8 encoding conversions, but '''arbitrary'''<->UTF-8 ones. That requires significant support on software part. |
| 153 | |
| 154 | == Backward Compatibility == |
| 155 | |
| 156 | The GDAL/OGR backward compatibility will be broken by this new functionality |
| 157 | in the way how 8-bit characters handled. Before users may rely on that all |
| 158 | 8-bit character strings will be passed throgh the GDAL/OGR without change and |
| 159 | will contain exact the same data all the way. Now it is only true for 7-bit |
| 160 | ASCII and 8-bit UTF-8 encoded strings. Note, that if you used only ASCII |
| 161 | subset with GDAL, you are not affected by these changes. |
| 162 | |
| 163 | == References == |
| 164 | * FAQ on how to use Unicode in software: http://www.cl.cam.ac.uk/~mgk25/unicode.html |
| 165 | * FLTK implementation of string conversion functions: http://svn.easysw.com/public/fltk/fltk/trunk/src/utf.c |
| 166 | * http://www.easysw.com/~mike/fltk/doc-2.0/html/utf_8h.html |
| 167 | * Ticket #1494 : UTF-8 encoding for GML output. |
| 168 | |