wiki:rfc30_utf8_filenames

Version 5 (modified by warmerdam, 14 years ago) ( diff )

--

RFC 30: Unicode Filenames

Authors: Frank Warmerdam
Contact: warmerdam@…
Status: Development

Summary

This document describes steps to generally handle filenames as UTF-8 strings in GDAL/OGR. In brief it will be assumed that filenames passed into and returned by GDAL/OGR interfaces are UTF-8. On some operating systems, notably Windows, this will require use of "wide character" interfaces in the low level VSI*L API.

Key Interfaces

VSI*L API

Likely it is only the cpl_vsil_win32.cpp implementation of these functions that needs to be made UTF-8 aware.

  • VSIFOpenL()
  • VSIFStatL()
  • VSIReadDir()
  • VSIMkdir()
  • VSIRmdir()
  • VSIUnlink()
  • VSIRename()

Old (small file) VSI API

Do we need to convert the old (real fopen()) based API to support UTF-8 filenames? On windows this might be hard. Perhaps we could take this opportunity to reduce it's use to a minimum?

  • VSIFOpen()
  • VSIStat()

Filename Parsing

Do we need to convert to UCS-16 to do parsing or can we safely assume that special characters like '/', '.', '\' and ':' never occur as part of UTF-8 multi-byte sequences?

  • CPLGetPath()
  • CPLGetDirname()
  • CPLGetFilename()
  • CPLGetBasename()
  • CPLGetExtension()

  • CPLResetExtension()

Other

  • CPLStat()
  • CPLGetCurrentDir()

Windows

Currently Windows's cpl_vsil_win32.cpp module uses CreateFile() with ascii filenames. It needs to be converted to use CreateFileW() and other wide character functions for stat(), rename, mkdir, etc. Prototype implementation already developed (r20620).

Linux / Unix / MacOS X

On modern linux, unix and MacOS operating systems the fopen(), stat(), readdir() functions already support UTF-8 strings. It is not currently anticipated that any work will be needed on Linux/Unix/MacOS X though there is some question about this. It is considered permissible under the definition of this RFC for old, and substandard operating systems (WinCE?) to support only ASCII, not UTF-8 filenames.

Python Changes

I observe with Python 2.6 that functions like gdal.Open() do not accept unicode strings, but they do accept utf-8 string objects. One possible solution is to update the bindings in selective places to identify unicode strings passed in, and transform them to utf-8 strings.

eg.

filename =  u'xx\u4E2D\u6587.\u4E2D\u6587'
if type(filename) == type(u'a'):
    filename = filename.encode('utf-8')

I'm not sure what the easiest way is to accomplish this in the bindings. The key entries are:

Note that failure to address these would only be an inconvenience requiring the applications to do the utf-8 conversion themselves.

In theory functions that return filenames, such as gdal.ReadDir() should return unicode strings for filenames, but from my perspective it seems adequate to always return utf-8 strings and let the application translate if needed.

Java / C# / perl Changes

I am not familiar enough with these environments to know if any changes are desirable to support unicode filenames more smoothly.

Utilities

Commandline Issues

On Windows it may be very difficult to specify multi-byte filenames at the commandline. In my experience with a Chinese named file, the commandline treats multi-byte characters as '?'. I'm not sure if this is a locale dependent issue.

File Formats

The proposed implementation really only addresses file format drivers that use VSIFOpenL() and related functions. Some drivers dependent on external libraries (ie. netcdf) do not have a way to hook the file IO API and may not support utf-8 filenames. It might be nice to be able to distinguish these. At the very least any driver marked with GDAL_DCAP_VIRTUALIO as "YES" will support UTF-8. Perhaps this opporunity ought to be used to more uniformly apply this driver metadata.

As part of this effort the following drivers have been marked as supporting virtualio (VSI*L API) after testing, or upgraded to support it: BSB, PDS, ISIS2, ISIS3, DIMAP, AIG, XPM, CEOS, SAR_CEOS, SDTS, FIT, GRIB, EIR, LAN, INGR.

Test Suite

We will need to introduce some test suite tests with multibyte utf-8 filenames. In support of that aspects of the VSI*L API - particularly the rename, mkdir, rmdir, functions and VSIFOpenL itself have been exposed in python.

Note: See TracWiki for help on using the wiki.