Changes between Version 5 and Version 6 of rfc30_utf8_filenames


Ignore:
Timestamp:
Sep 21, 2010, 3:09:21 PM (14 years ago)
Author:
warmerdam
Comment:

various updates based on feedback.

Legend:

Unmodified
Added
Removed
Modified
  • rfc30_utf8_filenames

    v5 v6  
    3232=== Filename Parsing ===
    3333
    34 Do we need to convert to UCS-16 to do parsing or can we safely assume that special characters like '/', '.', '\' and ':' never occur as part of UTF-8 multi-byte sequences? 
     34Because the path/extension delimiter characters '.', '\', '/' and ':' will never appear in the non-ascii portion of utf-8 strings we can safely leave the existing path parsing functions working as they do now.  They do not need to be aware of the real character boundaries for exotic characters in utf-8 paths.  The following will be left unchanged.
    3535
    3636 * CPLGetPath()
     
    3939 * CPLGetBasename()
    4040 * CPLGetExtension()
    41  
    4241 * CPLResetExtension()
    4342
    4443=== Other ===
     44
    4545 * CPLStat()
    4646 * CPLGetCurrentDir()
     47 * GDALDataset::GetFileList()
     48
     49These will all also need to treat filenames as utf-8.
    4750
    4851== Windows ==
     
    5356
    5457On modern linux, unix and MacOS operating systems the fopen(), stat(), readdir() functions already support UTF-8 strings.  It is not currently anticipated that any work will be needed on Linux/Unix/MacOS X though there is some question about this.  It is considered permissible  under the definition of this RFC for old, and substandard operating systems (WinCE?) to support only ASCII, not UTF-8 filenames.
     58
     59== Metadata ==
     60
     61There are a variety of places where general text may contain filenames.  One obvious case is the subdataset filenames returned from the SUBDATASET domain.  Previously these were just exposed as plain text and interpretation of the character set was undefined.  As part of this RFC we state that such filenames should be considered to be in utf-8 format. 
    5562
    5663== Python Changes ==
     
    7582 * gdal.Unlink()
    7683
    77 Note that failure to address these would only be an inconvenience requiring the applications to do the utf-8 conversion themselves.
     84Similarly all interfaces (ie. gdal.ReadDir()) that return filenames will hereafter return unicode objects rather than string objects.
    7885
    79 In theory functions that return filenames, such as gdal.ReadDir() should return unicode strings for filenames, but from my perspective it seems adequate to always return utf-8 strings and let the application translate if needed.
     86Also note that in Python 3.x strings are always unicode.
    8087
    81 == Java / C# / perl Changes ==
     88== C# Changes ==
    8289
    83 I am not familiar enough with these environments to know if any changes are desirable to support unicode filenames more smoothly.
     90Tamas notes that in C# we normally convert the unicode C# strings into C string with the PtrToStringAnsi marshaller.  Presumably we will need to use a utf-8 converter for all interface strings considered to be filenames.  I would note this should also apploy to OGR string attribute values which are also intended to be treated as utf-8.
    8491
    85 == Utilities ==
     92(It is unclear who will take care of this aspect since the primary author (FrankW) is not C#-binding-competent.
     93
     94== perl Changes ==
     95
     96Still discussing with Ari whether any special conversion is needed for the perl bindings or not.
     97
     98== Java Changes ==
     99
     100No changes are needed for Java.  Java strings are unicode, and they are already converted to utf-8 in the java swig bindings.
     101That is, the java bindings already assumed passing and receiving utf-8 strings to/from GDAL/OGR.
    86102
    87103== Commandline Issues ==
    88104
    89 On Windows it may be very difficult to specify multi-byte filenames at the commandline.  In my experience with a Chinese named file, the commandline treats multi-byte characters as '?'.  I'm not sure if this is a locale dependent issue.
     105On windows argv[] as passed into main() will not generally be able to represent exotic filenames that can't be represented in the locale charset.  It is possible to fetch the commandline and parse it as wide characters using GetCommandLineW() and CommandLinetoArgvW() to capture ucs-16 filenames (easily converted to utf-8); however, this interferes with the use of setargv.obj to expand wildcards on windows.
     106
     107I have not been able to come up with a good solution, so for now I am not intending to make any changes to the GDAL/OGR commandline utilities to allow passing exotic filenames.  So this RFC is mainly aimed at ensuring that other applications using GDAL/OGR can utilize exotic filenames.
    90108
    91109== File Formats ==
    92110
    93111The proposed implementation really only addresses file format drivers that use VSIFOpenL() and related functions.  Some drivers dependent on external libraries (ie. netcdf) do not have a way to hook the file IO API and may not support utf-8 filenames.  It might be nice to be able to distinguish these.
    94 At the very least any driver marked with GDAL_DCAP_VIRTUALIO as "YES" will support UTF-8.  Perhaps this opporunity ought to be used to more uniformly apply this driver metadata.
     112
     113At the very least any driver marked with GDAL_DCAP_VIRTUALIO as "YES" will support UTF-8.  Perhaps this opportunity ought to be used to more uniformly apply this driver metadata.
    95114
    96115As part of this effort the following drivers have been marked as supporting virtualio (VSI*L API) after testing, or upgraded to support it: BSB, PDS, ISIS2, ISIS3, DIMAP, AIG, XPM, CEOS, SAR_CEOS, SDTS, FIT, GRIB, EIR, LAN, INGR.