id summary reporter owner description type status priority milestone component version severity resolution keywords cc 4971 C# Swig bindings not handling UTF-8 to UTF-16 conversion rburhum tamas "While fixing a bug with the GDAL/OGR plugin for ArcGIS, I noticed that strings returned by the C# Swig bindings did not seem correct. The strings that come out of the GDAL API are UTF-8, but the strings that come out of the C# bindings should be converted to UTF-16 equivalents. This is not done properly and we get odd results. Analysis: My underlying datastore is a spatialite file (attached). It has strings in UTF8. To verify this, I did: {{{ spatialite> select gid, test_txt from yolo; 705507|propriétaire 704391|abc spatialite> }}} Notice 705507 has the correct character displayed. To verify the hex utf values, I did: {{{ spatialite> select gid, hex(test_txt) from yolo; 705507|70726F707269C3A97461697265 704391|616263 }}} Upon inspection of the UTF8 table at http://www.utf8-chartable.de/ I could figure out that the string contains the following: {{{ 70 p 72 r 6F o 70 p 72 r 69 i C3 <---- C3A9 is the multibyte UTF8 char for é A9 é 74 t 61 a 69 i 72 r 65 e }}} On the C# side, I am grabbing the string with feature.GetFieldAsString contains the wrong value: {{{ feature.GetFieldAsString(ogrIndex) ""propriétaire"" }}} C# strings are supposed to be UTF16 strings according to the .NET documentation. Inspecting the string returned as bytes I find: {{{ byte[] utf16Bytes = Encoding.Unicode.GetBytes(feature.GetFieldAsString(ogrIndex)); {byte[26]} [0]: 112 [1]: 0 [2]: 114 [3]: 0 [4]: 111 [5]: 0 [6]: 112 [7]: 0 [8]: 114 [9]: 0 [10]: 105 [11]: 0 [12]: 195 [13]: 0 [14]: 169 [15]: 0 [16]: 116 [17]: 0 [18]: 97 [19]: 0 [20]: 105 [21]: 0 [22]: 114 [23]: 0 [24]: 101 [25]: 0 }}} By looking at the UTF16 table at http://www.fileformat.info/info/charset/UTF-16/list.htm or better yet, as hex values, I can see that the string characters look like this: {{{ - utf16Bytes {byte[0x0000001a]} byte[] [0x00000000] 0x70 byte p [0x00000001] 0x00 byte [0x00000002] 0x72 byte r [0x00000003] 0x00 byte [0x00000004] 0x6f byte o [0x00000005] 0x00 byte [0x00000006] 0x70 byte p [0x00000007] 0x00 byte [0x00000008] 0x72 byte r [0x00000009] 0x00 byte [0x0000000a] 0x69 byte i [0x0000000b] 0x00 byte [0x0000000c] 0xc3 byte à <-- 00c3 is à [0x0000000d] 0x00 byte [0x0000000e] 0xa9 byte © <-- 00A9 is © [0x0000000f] 0x00 byte [0x00000010] 0x74 byte t [0x00000011] 0x00 byte [0x00000012] 0x61 byte a [0x00000013] 0x00 byte [0x00000014] 0x69 byte i [0x00000015] 0x00 byte [0x00000016] 0x72 byte r [0x00000017] 0x00 byte [0x00000018] 0x65 byte e [0x00000019] 0x00 byte }}} so that explains why I am seeing this characters. The problem should be obvious. The multibyte UTF-8 char that was C3A9 (é) should have been converted to UTF-16 char 00a9 (é) and instead it was split into *two* UTF16 chars 00c3 and 00a9. The fix should be a one-liner in the SWIG bindings, but I am not smart enough to figure out where this happens in the SWIG pipeline. " defect closed normal CSharpBindings 1.9.2 normal fixed unicode, csharp, utf tamas warmerdam