Changes between Initial Version and Version 1 of rfc5_unicode


Ignore:
Timestamp:
Apr 28, 2007, 12:12:17 PM (17 years ago)
Author:
warmerdam
Comment:

Ported from doxygen.

Legend:

Unmodified
Added
Removed
Modified
  • rfc5_unicode

    v1 v1  
     1= RFC 5: Unicode support in GDAL =
     2
     3Author: Andrey Kiselev[[BR]]
     4Contact: dron@ak4719.spb.edu[[BR]]
     5Status: Development
     6
     7== Summary ==
     8
     9This document contains proposal on how to make GDAL core locale independent
     10preserving support for native character sets.
     11
     12== Main concepts ==
     13
     14GDAL should be modified in a way to support three following main ideas:
     15
     16 1. Users work in localized environment using their native languages. That means we can not assume ASCII character set when working with string data passed to GDAL.
     17 2. GDAL uses UTF-8 encoding internally when working with strings.
     18 3. GDAL uses Unicode version of third-party APIs when it is possible.
     19
     20So all strings, used in GDAL, are in UTF-8, not in plain ASCII. That means we
     21should convert user's input from the local encoding to UTF-8 during
     22interactive sessions. The opposite should be done for GDAL output. For
     23example, when user passes a filename as a command-line parameter to GDAL
     24utilities, that filename should be immediately converted to UTF-8 and only
     25afetrwards passed to functions like GDALOpen() or OGROpen(). All functions,
     26wich take character strings as parameters, assume UTF-8 (with except of
     27several ones, which will do the conversion between different encodings, see
     28Implementation). The same is valid for output functions. Output
     29functions (CPLError/CPLDebug), embedded in GDAL, should convert all strings
     30from UTF-8 to local encoding just before printing them. Custom error handlers
     31should be aware of UTF-8 issue and do the proper transformation of strings
     32passed to them.
     33
     34The string encoding pops up again when GDAL needs to call the third-party API.
     35UTF-8 should be converted to encoding suitable for that API. In particular,
     36that means we should convert UTF-8 to UTF-16 before calling CreateFile()
     37function in Windows implementation of VSIFOpenL(). Another example is a
     38PostgreSQL API. PostgreSQL stores strings in UTF-8 encoding internally, so we
     39should notify server that passed string is already in UTF-8 and it will be
     40stored as is without any conversions and losses.
     41
     42For file format drivers the string representation should be worked out on
     43per-driver basis. Not all file formats support non-ASCII characters. For
     44example, various .HDR labeled rasters are just 7-bit ASCII text files and it
     45is not a good idea to write 8-bit strings in such a files. When we need to
     46pass strings, extracted from such file outside the driver (e.g., in
     47SetMetadata() call), we should convert them to UTF-8. If you just want to use
     48extracted strings internally in driver, there is no need in any conversions.
     49
     50In some cases the file encoding can differ from the local system encoding and
     51we do not have a way to know the file encoding other than ask a user (for
     52example, imagine a case when someone added a 8-bit non-ASCII string field to
     53mentioned above plain text .HDR file). That means we can't use conversion form
     54the local encoding to UTF-8, but from the file encoding to UTF-8. So we need a
     55way to get file encoding in some way on per datasource basis. The natural
     56solution of the problem is to introduce optional open parameter "ENCODING" to
     57GDALOpen/OGROpen functions.  Unfortunately, those functions do not accept
     58options.  That should be introduced in another RFC. Fortunately, tehre is no
     59need to add encoding parameter immediately, because it is independent from the
     60general i18n process. We can add UTF-8 support as it is defined in this RFC
     61and add support for forcing per-datasource encoding later, when the open
     62options will be introduced.
     63
     64== Implementation ==
     65
     66 * New character conversion functions will be introduced in CPLString class. Objects of that class always contain UTF-8 string internally.
     67{{{
     68
     69// Get string in local encoding from the internal UTF-8 encoded string.
     70// Out-of-range characters replaced with '?' in output string.
     71// nEncoding A codename of encoding. If 0 the local system
     72// encoding will be used.
     73char* CPLString::recode( int nEncoding = 0 );
     74
     75// Construct UTF-8 string object from string in other encoding
     76// nEncoding A codename of encoding. If 0 the local system
     77// encoding will be used.
     78CPLString::CPLString( const char*, int nEncoding );
     79
     80// Construct UTF-8 string object from array of wchar_t elements.
     81// Source encoding is system specific.
     82CPLString::CPLString( wchar_t* );
     83
     84// Get string from UTF-8 encoding into array of wchar_t elements.
     85// Destination encoding is system specific.
     86operator wchar_t* (void) const;
     87}}}
     88
     89 * In order to use non-ASCII characters in user input every application should call setlocale(LC_ALL,  "") function right after the entry point.
     90
     91 * Code example. Let's look how the gdal utilities and core code should be changed in regard to Unicode.
     92
     93For input instead of
     94
     95{{{
     96pszFilename = argv[i];
     97if( pszFilename )
     98        hDataset = GDALOpen( pszFilename, GA_ReadOnly );
     99}}}
     100
     101we should do
     102
     103{{{
     104
     105CPLString oFilename(argv[i], 0); // <-- Conversion from local encoding to UTF-8
     106hDataset = GDALOpen( oFilename.c_str(), GA_ReadOnly );
     107
     108}}}
     109
     110For output instead of
     111
     112{{{
     113
     114printf( "Description = %s\n", GDALGetDescription(hBand) );
     115
     116}}}
     117
     118we should do
     119
     120{{{
     121
     122CPLString oDescription( GDALGetDescription(hBand) );
     123printf( "Description = %s\n", oDescription.recode( 0 ) ); // <-- Conversion
     124                                                        // from UTF-8 to local
     125
     126}}}
     127
     128The filename passed to GDALOpen() in UTF-8 encoding in the code snippet
     129above will be further processed in the GDAL core. On Windows instead of
     130
     131{{{
     132
     133hFile = CreateFile( pszFilename, dwDesiredAccess,
     134        FILE_SHARE_READ | FILE_SHARE_WRITE, NULL, dwCreationDisposition,
     135        dwFlagsAndAttributes, NULL );
     136
     137}}}
     138
     139we do
     140
     141{{{
     142
     143CPLString oFilename( pszFilename );
     144// I am prefer call the wide character version explicitly
     145// rather than specify _UNICODE switch.
     146hFile = CreateFileW( (wchar_t *)oFilename, dwDesiredAccess,
     147        FILE_SHARE_READ | FILE_SHARE_WRITE, NULL,
     148        dwCreationDisposition,  dwFlagsAndAttributes, NULL );
     149
     150}}}
     151
     152 * The actual implementation of the character conversion functions does not specified in this document yet. It needs additional discussion. The main problem is that we need not only local<->UTF-8 encoding conversions, but '''arbitrary'''<->UTF-8 ones. That requires significant support on software part.
     153
     154== Backward Compatibility ==
     155
     156The GDAL/OGR backward compatibility will be broken by this new functionality
     157in the way how 8-bit characters handled. Before users may rely on that all
     1588-bit character strings will be passed throgh the GDAL/OGR without change and
     159will contain exact the same data all the way. Now it is only true for 7-bit
     160ASCII and 8-bit UTF-8 encoded strings. Note, that if you used only ASCII
     161subset with GDAL, you are not affected by these changes.
     162
     163== References ==
     164 * FAQ on how to use Unicode in software: http://www.cl.cam.ac.uk/~mgk25/unicode.html
     165 * FLTK implementation of string conversion functions: http://svn.easysw.com/public/fltk/fltk/trunk/src/utf.c
     166 * http://www.easysw.com/~mike/fltk/doc-2.0/html/utf_8h.html
     167 * Ticket #1494 : UTF-8 encoding for GML output.
     168