| 1 | = RFC 30: Unicode Filenames = |
| 2 | |
| 3 | Authors: Frank Warmerdam[[BR]] |
| 4 | Contact: warmerdam@pobox.com[[BR]] |
| 5 | Status: Development |
| 6 | |
| 7 | == Summary == |
| 8 | |
| 9 | This document describes steps to generally handle filenames as UTF-8 strings in GDAL/OGR. In brief it will be assumed that filenames passed into and returned by GDAL/OGR interfaces are UTF-8. On some operating systems, notably Windows, this will require use of "wide character" interfaces in the low level VSI*L API. |
| 10 | |
| 11 | == Key Interfaces == |
| 12 | |
| 13 | === VSI*L API === |
| 14 | |
| 15 | Likely it is only the cpl_vsil_win32.cpp implementation of these functions that needs to be made UTF-8 aware. |
| 16 | |
| 17 | * VSIFOpenL() |
| 18 | * VSIFStatL() |
| 19 | * VSIReadDir() |
| 20 | * VSIMkdir() |
| 21 | * VSIRmdir() |
| 22 | * VSIUnlink() |
| 23 | * VSIRename() |
| 24 | |
| 25 | === Old (small file) VSI API === |
| 26 | |
| 27 | Do we need to convert the old (real fopen()) based API to support UTF-8 filenames? On windows this might be hard. Perhaps we could take this opportunity to reduce it's use to a minimum? |
| 28 | |
| 29 | * VSIFOpen() |
| 30 | * VSIStat() |
| 31 | |
| 32 | === Filename Parsing === |
| 33 | |
| 34 | Do we need to convert to UCS-16 to do parsing or can we safely assume that special characters like '/', '.', '\' and ':' never occur as part of UTF-8 multi-byte sequences? |
| 35 | |
| 36 | * CPLGetPath() |
| 37 | * CPLGetDirname() |
| 38 | * CPLGetFilename() |
| 39 | * CPLGetBasename() |
| 40 | * CPLGetExtension() |
| 41 | |
| 42 | * CPLResetExtension() |
| 43 | |
| 44 | === Other === |
| 45 | * CPLStat() |
| 46 | * CPLGetCurrentDir() |
| 47 | |
| 48 | == Windows == |
| 49 | |
| 50 | Currently Windows's cpl_vsil_win32.cpp module uses CreateFile() with ascii filenames. It needs to be converted to use CreateFileW() and other wide character functions for stat(), rename, mkdir, etc. Prototype implementation already developed. |
| 51 | |
| 52 | == Linux / Unix / MacOS X == |
| 53 | |
| 54 | On modern linux, unix and MacOS operating systems the fopen(), stat(), readdir() functions already support UTF-8 strings. It is not currently anticipated that any work will be needed on Linux/Unix/MacOS X though there is some question about this. It is considered permissible under the definition of this RFC for old, and substandard operating systems (WinCE?) to support only ASCII, not UTF-8 filenames. |
| 55 | |
| 56 | == Python Changes == |
| 57 | |
| 58 | Review whether we should have some GDAL/OGR Python functions (ie. gdal.ReadDir()) return a unicode string instead of a byte string, and see if gdal.Open() and related should handle unicode strings as input intelligently if they don't already. |
| 59 | |
| 60 | == Utilities == |
| 61 | |
| 62 | == Commandline Issues == |
| 63 | |
| 64 | On Windows it may be very difficult to specify multi-byte filenames at the commandline. In my experience with a Chinese named file, the commandline treats multi-byte characters as '?'. I'm not sure if this is a locale dependent issue. |
| 65 | |
| 66 | == File Formats == |
| 67 | |
| 68 | The proposed implementation really only addresses file format drivers that use VSIFOpenL() and related functions. Some drivers dependent on external libraries (ie. netcdf) do not have a way to hook the file IO API and may not support utf-8 filenames. It might be nice to be able to distinguish these. |
| 69 | At the very least any driver marked with GDAL_DCAP_VIRTUALIO as "YES" will support UTF-8. Perhaps this opporunity ought to be used to more uniformly apply this driver metadata. |
| 70 | |
| 71 | == Test Suite == |
| 72 | |
| 73 | We will need to introduce some test suite tests with multibyte utf-8 filenames. |