Opened 10 years ago

Closed 13 months ago

#3403 closed enhancement (wontfix)

Add character set information to data sources that have that information

Reported by: gaige Owned by: warmerdam
Priority: normal Milestone: closed_because_of_github_migration
Component: OGR_SF Version: svn-trunk
Severity: normal Keywords:
Cc:

Description

Some data sources have existing mechanisms (or strict definitions) of the character sets that are used in the data contained therein. This enhancement adds an optional (returns OGRERR_UNSUPPORTED_OPERATION by default for Set and NULL by default for Get) dataset operation to get and set the character set for a dataset.

Included in this package are implementations for:

  • KML
  • MITAB
  • GML
  • Tiger

-Gaige

Attachments (1)

cs-diff (5.6 KB) - added by gaige 10 years ago.
cs-diff

Download all attachments as: .zip

Change History (8)

Changed 10 years ago by gaige

Attachment: cs-diff added

cs-diff

comment:1 Changed 10 years ago by Even Rouault

I think this is a complex issue that should be discussed in a RFC, or extending an existing RFC such as http://trac.osgeo.org/gdal/wiki/rfc23_ogr_unicode. The shapefile driver is also concerned by encoding/charset issues. One possibility suggested by RFC23 would be to use UTF-8 as the universal pivot to convert from/to particular encodings.

I've a patch in my local tree to use iconv to GDAL in addition to the basic re-encoding capabilities offered by port/cpl_recode_stub.cpp. I've also looked a bit at the shapefile driver but I haven't found enough test data in various encodings to be confident with the approach.

comment:2 Changed 10 years ago by gaige

There are a couple of problems with the approach of using a central pivot language. Probably the most significant is that it hides the issue of graceful degradation from software that is interested in doing that. Especially with formats with fixed-length fields, it's hard to tell how many characters will fit in the field when you're done. The other is that the declarations in the file might be wrong (sadly, I've run across this more often than I care to remember).

I can understand not wanting to implement this patch until this is philosophically dealt with.

I'd love to participate in some further discussion about these particular (and possibly peculiar) issues. The mechanism that I chose to use was based on experience with having many of the file formats providing inaccurate information, but information that could be misinterpreted by a translator (overlapping space in one-byte character sets is guaranteed), so I'd like to find some way to address that issue as well.

Thanks, -Gaige

comment:3 Changed 10 years ago by warmerdam

I concur with Even that this seems to violate the approach adopted in RFC 23, and is to significant a change to make without broader discussion.

I do understand that some files improperly declare their character set, making it difficult to convert them to UTF-8. I wouldn't be adverse to some mechanism to provide an override once clearly enough understood.

I don't really follow the issue with fixed width fields and not knowing how long a string will be. There are mechanisms to turn utf-8 into a form that indicates the exact number of characters if needed.

comment:4 Changed 10 years ago by gaige

I'm happy to extend RFC23 via discussion for potential implementation in a future version in trunk.

The concerns that I think we need to address with any solution are:

  • Minimize implementation effort for GDAL library users (which the proposed pivot approach does)
  • Provide a standard mechanism to determine whether the file format in use supports multiple character sets for output and make that functionality available to the GDAL library user
  • How these changes affect the OGRFiledDefn management of fixed-width fields (in particular, when writing localized character sets, it is possible to put more characters into a field than when using UTF-8 to represent those same characters)
  • Provide a way to gracefully override ambiguous or incorrectly set character sets (might require use of some mechanism to be able to ask the open to "reinterpret" the data as another character set when it opens the file next, for example).
  • Consider some mechanism to turn off the pivot for applications that prefer to handle the problem themselves

Do we want to shut down this ticket, or change it to some other state and then move this discussion to the wiki?

Thanks, -Gaige

comment:5 Changed 5 years ago by Jukka Rahkonen

This ticket contains a patch for KML - MITAB - GML - Tiger. If the conclusion was that some other kind of implementation would be better and patches will not be applied, wouldn't it be better to close this ticket and start a new discussion?

comment:6 Changed 5 years ago by Even Rouault

Milestone: 1.8.1

Removing obsolete milestone

comment:7 Changed 13 months ago by Even Rouault

Milestone: closed_because_of_github_migration
Resolution: wontfix
Status: newclosed

This ticket has been automatically closed because Trac is no longer used for GDAL bug tracking, since the project has migrated to GitHub?. If you believe this ticket is still valid, you may file it to https://github.com/OSGeo/gdal/issues if it is not already reported there.

Note: See TracTickets for help on using tickets.