= RFC 30: Unicode Filenames = Authors: Frank Warmerdam[[BR]] Contact: warmerdam@pobox.com[[BR]] Status: Development == Summary == This document describes steps to generally handle filenames as UTF-8 strings in GDAL/OGR. In brief it will be assumed that filenames passed into and returned by GDAL/OGR interfaces are UTF-8. On some operating systems, notably Windows, this will require use of "wide character" interfaces in the low level VSI*L API. == Key Interfaces == === VSI*L API === Likely it is only the cpl_vsil_win32.cpp implementation of these functions that needs to be made UTF-8 aware. * VSIFOpenL() * VSIFStatL() * VSIReadDir() * VSIMkdir() * VSIRmdir() * VSIUnlink() * VSIRename() === Old (small file) VSI API === Do we need to convert the old (real fopen()) based API to support UTF-8 filenames? On windows this might be hard. Perhaps we could take this opportunity to reduce it's use to a minimum? * VSIFOpen() * VSIStat() === Filename Parsing === Do we need to convert to UCS-16 to do parsing or can we safely assume that special characters like '/', '.', '\' and ':' never occur as part of UTF-8 multi-byte sequences? * CPLGetPath() * CPLGetDirname() * CPLGetFilename() * CPLGetBasename() * CPLGetExtension() * CPLResetExtension() === Other === * CPLStat() * CPLGetCurrentDir() == Windows == Currently Windows's cpl_vsil_win32.cpp module uses CreateFile() with ascii filenames. It needs to be converted to use CreateFileW() and other wide character functions for stat(), rename, mkdir, etc. Prototype implementation already developed (r20620). == Linux / Unix / MacOS X == On modern linux, unix and MacOS operating systems the fopen(), stat(), readdir() functions already support UTF-8 strings. It is not currently anticipated that any work will be needed on Linux/Unix/MacOS X though there is some question about this. It is considered permissible under the definition of this RFC for old, and substandard operating systems (WinCE?) to support only ASCII, not UTF-8 filenames. == Python Changes == I observe with Python 2.6 that functions like gdal.Open() do not accept unicode strings, but they do accept utf-8 string objects. One possible solution is to update the bindings in selective places to identify unicode strings passed in, and transform them to utf-8 strings. eg. {{{ filename = u'xx\u4E2D\u6587.\u4E2D\u6587' if type(filename) == type(u'a'): filename = filename.encode('utf-8') }}} I'm not sure what the easiest way is to accomplish this in the bindings. The key entries are: * gdal.Open() * ogr.Open() * gdal.ReadDir() * gdal.PushFinderLocation() * gdal.FindFile() * gdal.Unlink() Note that failure to address these would only be an inconvenience requiring the applications to do the utf-8 conversion themselves. In theory functions that return filenames, such as gdal.ReadDir() should return unicode strings for filenames, but from my perspective it seems adequate to always return utf-8 strings and let the application translate if needed. == Java / C# / perl Changes == I am not familiar enough with these environments to know if any changes are desirable to support unicode filenames more smoothly. == Utilities == == Commandline Issues == On Windows it may be very difficult to specify multi-byte filenames at the commandline. In my experience with a Chinese named file, the commandline treats multi-byte characters as '?'. I'm not sure if this is a locale dependent issue. == File Formats == The proposed implementation really only addresses file format drivers that use VSIFOpenL() and related functions. Some drivers dependent on external libraries (ie. netcdf) do not have a way to hook the file IO API and may not support utf-8 filenames. It might be nice to be able to distinguish these. At the very least any driver marked with GDAL_DCAP_VIRTUALIO as "YES" will support UTF-8. Perhaps this opporunity ought to be used to more uniformly apply this driver metadata. As part of this effort the following drivers have been marked as supporting virtualio (VSI*L API) after testing, or upgraded to support it: BSB, PDS, ISIS2, ISIS3, DIMAP, AIG, XPM, CEOS, SAR_CEOS, SDTS, FIT, GRIB, EIR, LAN, INGR. == Test Suite == We will need to introduce some test suite tests with multibyte utf-8 filenames. In support of that aspects of the VSI*L API - particularly the rename, mkdir, rmdir, functions and VSIFOpenL itself have been exposed in python.