Opened 11 years ago
Closed 9 years ago
#4971 closed defect (fixed)
C# Swig bindings not handling UTF-8 to UTF-16 conversion
Reported by: | rburhum | Owned by: | tamas |
---|---|---|---|
Priority: | normal | Milestone: | |
Component: | CSharpBindings | Version: | 1.9.2 |
Severity: | normal | Keywords: | unicode, csharp, utf |
Cc: | tamas, warmerdam |
Description
While fixing a bug with the GDAL/OGR plugin for ArcGIS, I noticed that strings returned by the C# Swig bindings did not seem correct. The strings that come out of the GDAL API are UTF-8, but the strings that come out of the C# bindings should be converted to UTF-16 equivalents. This is not done properly and we get odd results.
Analysis:
My underlying datastore is a spatialite file (attached). It has strings in UTF8. To verify this, I did:
spatialite> select gid, test_txt from yolo; 705507|propriétaire 704391|abc spatialite>
Notice 705507 has the correct character displayed. To verify the hex utf values, I did:
spatialite> select gid, hex(test_txt) from yolo; 705507|70726F707269C3A97461697265 704391|616263
Upon inspection of the UTF8 table at http://www.utf8-chartable.de/ I could figure out that the string contains the following:
70 p 72 r 6F o 70 p 72 r 69 i C3 <---- C3A9 is the multibyte UTF8 char for é A9 é 74 t 61 a 69 i 72 r 65 e
On the C# side, I am grabbing the string with feature.GetFieldAsString contains the wrong value:
feature.GetFieldAsString(ogrIndex) "propriétaire"
C# strings are supposed to be UTF16 strings according to the .NET documentation. Inspecting the string returned as bytes I find:
byte[] utf16Bytes = Encoding.Unicode.GetBytes(feature.GetFieldAsString(ogrIndex)); {byte[26]} [0]: 112 [1]: 0 [2]: 114 [3]: 0 [4]: 111 [5]: 0 [6]: 112 [7]: 0 [8]: 114 [9]: 0 [10]: 105 [11]: 0 [12]: 195 [13]: 0 [14]: 169 [15]: 0 [16]: 116 [17]: 0 [18]: 97 [19]: 0 [20]: 105 [21]: 0 [22]: 114 [23]: 0 [24]: 101 [25]: 0
By looking at the UTF16 table at http://www.fileformat.info/info/charset/UTF-16/list.htm
or better yet, as hex values, I can see that the string characters look like this:
- utf16Bytes {byte[0x0000001a]} byte[] [0x00000000] 0x70 byte p [0x00000001] 0x00 byte [0x00000002] 0x72 byte r [0x00000003] 0x00 byte [0x00000004] 0x6f byte o [0x00000005] 0x00 byte [0x00000006] 0x70 byte p [0x00000007] 0x00 byte [0x00000008] 0x72 byte r [0x00000009] 0x00 byte [0x0000000a] 0x69 byte i [0x0000000b] 0x00 byte [0x0000000c] 0xc3 byte à <-- 00c3 is à [0x0000000d] 0x00 byte [0x0000000e] 0xa9 byte © <-- 00A9 is © [0x0000000f] 0x00 byte [0x00000010] 0x74 byte t [0x00000011] 0x00 byte [0x00000012] 0x61 byte a [0x00000013] 0x00 byte [0x00000014] 0x69 byte i [0x00000015] 0x00 byte [0x00000016] 0x72 byte r [0x00000017] 0x00 byte [0x00000018] 0x65 byte e [0x00000019] 0x00 byte
so that explains why I am seeing this characters. The problem should be obvious. The multibyte UTF-8 char that was C3A9 (é) should have been converted to UTF-16 char 00a9 (é) and instead it was split into *two* UTF16 chars 00c3 and 00a9.
The fix should be a one-liner in the SWIG bindings, but I am not smart enough to figure out where this happens in the SWIG pipeline.
Change History (8)
comment:1 by , 11 years ago
comment:2 by , 11 years ago
Cc: | added |
---|
comment:3 by , 11 years ago
Cc: | added |
---|
comment:4 by , 11 years ago
For anybody interested in a workaround that will make it work correctly until the fix can be written in the swig bindings themselves, you can look at my "ghetto string workaround" :)
https://github.com/RBURHUM/arcgis-ogr/commit/341a701b0c1a096f4da304d7965403b7618d55e9
comment:7 by , 11 years ago
Resolution: | fixed |
---|---|
Status: | closed → reopened |
Thinking to back port the fix to the latest stable branch as well
comment:8 by , 9 years ago
Resolution: | → fixed |
---|---|
Status: | reopened → closed |
What was trunk then is latest stable branch now so letting time to go by fixed this part of the issue.
Cannot attach the file, but you can download it from my Dropbox https://dl.dropbox.com/u/4779803/gdal-ogrplugin/bugs/ogrtest_utf.sqlite