= RFC 30: Unicode Filenames = Authors: Frank Warmerdam[[BR]] Contact: warmerdam@pobox.com[[BR]] Status: Development == Summary == This document describes steps to generally handle filenames as UTF-8 strings in GDAL/OGR. In brief it will be assumed that filenames passed into and returned by GDAL/OGR interfaces are UTF-8. On some operating systems, notably Windows, this will require use of "wide character" interfaces in the low level VSI*L API. == Key Interfaces == === VSI*L API === Likely it is only the cpl_vsil_win32.cpp implementation of these functions that needs to be made UTF-8 aware. * VSIFOpenL() * VSIFStatL() * VSIReadDir() * VSIMkdir() * VSIRmdir() * VSIUnlink() * VSIRename() === Old (small file) VSI API === Do we need to convert the old (real fopen()) based API to support UTF-8 filenames? On windows this might be hard. Perhaps we could take this opportunity to reduce it's use to a minimum? * VSIFOpen() * VSIStat() === Filename Parsing === Do we need to convert to UCS-16 to do parsing or can we safely assume that special characters like '/', '.', '\' and ':' never occur as part of UTF-8 multi-byte sequences? * CPLGetPath() * CPLGetDirname() * CPLGetFilename() * CPLGetBasename() * CPLGetExtension() * CPLResetExtension() === Other === * CPLStat() * CPLGetCurrentDir() == Windows == Currently Windows's cpl_vsil_win32.cpp module uses CreateFile() with ascii filenames. It needs to be converted to use CreateFileW() and other wide character functions for stat(), rename, mkdir, etc. Prototype implementation already developed. == Linux / Unix / MacOS X == On modern linux, unix and MacOS operating systems the fopen(), stat(), readdir() functions already support UTF-8 strings. It is not currently anticipated that any work will be needed on Linux/Unix/MacOS X though there is some question about this. It is considered permissible under the definition of this RFC for old, and substandard operating systems (WinCE?) to support only ASCII, not UTF-8 filenames. == Python Changes == Review whether we should have some GDAL/OGR Python functions (ie. gdal.ReadDir()) return a unicode string instead of a byte string, and see if gdal.Open() and related should handle unicode strings as input intelligently if they don't already. == Utilities == == Commandline Issues == On Windows it may be very difficult to specify multi-byte filenames at the commandline. In my experience with a Chinese named file, the commandline treats multi-byte characters as '?'. I'm not sure if this is a locale dependent issue. == File Formats == The proposed implementation really only addresses file format drivers that use VSIFOpenL() and related functions. Some drivers dependent on external libraries (ie. netcdf) do not have a way to hook the file IO API and may not support utf-8 filenames. It might be nice to be able to distinguish these. At the very least any driver marked with GDAL_DCAP_VIRTUALIO as "YES" will support UTF-8. Perhaps this opporunity ought to be used to more uniformly apply this driver metadata. == Test Suite == We will need to introduce some test suite tests with multibyte utf-8 filenames.