Opened 5 years ago

Closed 3 years ago

#4971 closed defect (fixed)

C# Swig bindings not handling UTF-8 to UTF-16 conversion

Reported by: rburhum Owned by: tamas
Priority: normal Milestone:
Component: CSharpBindings Version: 1.9.2
Severity: normal Keywords: unicode, csharp, utf
Cc: tamas, warmerdam

Description

While fixing a bug with the GDAL/OGR plugin for ArcGIS, I noticed that strings returned by the C# Swig bindings did not seem correct. The strings that come out of the GDAL API are UTF-8, but the strings that come out of the C# bindings should be converted to UTF-16 equivalents. This is not done properly and we get odd results.

Analysis:

My underlying datastore is a spatialite file (attached). It has strings in UTF8. To verify this, I did:

spatialite> select gid, test_txt from yolo;
705507|propriétaire
704391|abc
spatialite>

Notice 705507 has the correct character displayed. To verify the hex utf values, I did:

spatialite> select gid, hex(test_txt) from yolo;
705507|70726F707269C3A97461697265
704391|616263

Upon inspection of the UTF8 table at http://www.utf8-chartable.de/ I could figure out that the string contains the following:

70 p
72 r
6F o
70 p
72 r
69 i
C3       <---- C3A9 is the multibyte UTF8 char for é
A9 é
74 t
61 a
69 i
72 r
65 e

On the C# side, I am grabbing the string with feature.GetFieldAsString? contains the wrong value:

feature.GetFieldAsString(ogrIndex)
"propriétaire"

C# strings are supposed to be UTF16 strings according to the .NET documentation. Inspecting the string returned as bytes I find:

byte[] utf16Bytes = Encoding.Unicode.GetBytes(feature.GetFieldAsString(ogrIndex));
{byte[26]}
    [0]: 112
    [1]: 0
    [2]: 114
    [3]: 0
    [4]: 111
    [5]: 0
    [6]: 112
    [7]: 0
    [8]: 114
    [9]: 0
    [10]: 105
    [11]: 0
    [12]: 195
    [13]: 0
    [14]: 169
    [15]: 0
    [16]: 116
    [17]: 0
    [18]: 97
    [19]: 0
    [20]: 105
    [21]: 0
    [22]: 114
    [23]: 0
    [24]: 101
    [25]: 0

By looking at the UTF16 table at http://www.fileformat.info/info/charset/UTF-16/list.htm

or better yet, as hex values, I can see that the string characters look like this:

-		utf16Bytes	{byte[0x0000001a]}	byte[]
		[0x00000000]	0x70	byte      p
		[0x00000001]	0x00	byte
		[0x00000002]	0x72	byte      r
		[0x00000003]	0x00	byte
		[0x00000004]	0x6f	byte      o
		[0x00000005]	0x00	byte
		[0x00000006]	0x70	byte      p
		[0x00000007]	0x00	byte
		[0x00000008]	0x72	byte      r
		[0x00000009]	0x00	byte
		[0x0000000a]	0x69	byte      i
		[0x0000000b]	0x00	byte
		[0x0000000c]	0xc3	byte      Ã    <-- 00c3 is Ã
		[0x0000000d]	0x00	byte
		[0x0000000e]	0xa9	byte      ©    <-- 00A9 is ©
		[0x0000000f]	0x00	byte
		[0x00000010]	0x74	byte      t
		[0x00000011]	0x00	byte
		[0x00000012]	0x61	byte      a
		[0x00000013]	0x00	byte
		[0x00000014]	0x69	byte      i
		[0x00000015]	0x00	byte
		[0x00000016]	0x72	byte      r
		[0x00000017]	0x00	byte
		[0x00000018]	0x65	byte      e
		[0x00000019]	0x00	byte

so that explains why I am seeing this characters. The problem should be obvious. The multibyte UTF-8 char that was C3A9 (é) should have been converted to UTF-16 char 00a9 (é) and instead it was split into *two* UTF16 chars 00c3 and 00a9.

The fix should be a one-liner in the SWIG bindings, but I am not smart enough to figure out where this happens in the SWIG pipeline.

Change History (8)

comment:1 Changed 5 years ago by rburhum

Cannot attach the file, but you can download it from my Dropbox https://dl.dropbox.com/u/4779803/gdal-ogrplugin/bugs/ogrtest_utf.sqlite

comment:2 Changed 5 years ago by rburhum

Cc: tamas added

comment:3 Changed 5 years ago by warmerdam

Cc: warmerdam added

comment:4 Changed 5 years ago by rburhum

For anybody interested in a workaround that will make it work correctly until the fix can be written in the swig bindings themselves, you can look at my "ghetto string workaround" :)

https://github.com/RBURHUM/arcgis-ogr/commit/341a701b0c1a096f4da304d7965403b7618d55e9

comment:5 Changed 4 years ago by tamas

Fixed in trunk (r25982)

comment:6 Changed 4 years ago by rburhum

Resolution: fixed
Status: newclosed

Sweet. Thank you Tamas!

comment:7 Changed 4 years ago by tamas

Resolution: fixed
Status: closedreopened

Thinking to back port the fix to the latest stable branch as well

comment:8 Changed 3 years ago by Jukka Rahkonen

Resolution: fixed
Status: reopenedclosed

What was trunk then is latest stable branch now so letting time to go by fixed this part of the issue.

Note: See TracTickets for help on using tickets.