Opened 17 years ago
Closed 15 years ago
#1526 closed defect (fixed)
Problem with widecharacter ISO8211 parsing
Reported by: | warmerdam | Owned by: | warmerdam |
---|---|---|---|
Priority: | normal | Milestone: | 1.5.4 |
Component: | OGR_SF | Version: | 1.5.0 |
Severity: | normal | Keywords: | s57 iso8211 unicode |
Cc: | Even Rouault |
Description (last modified by )
Dear Warmerdam, I am a programmer in China, and are using your ISO8211 lib to develop a S57 parser program for a customer. I found a Unicode string decoding problem as following: In your program you are using "chFormatDelimeter" as the splitting character to let user call "ExtractStringData(pszData, iBytesLeft, &iBytesConsumed)", I also found "chFormatDelimeter = DDF_UNIT_TERMINATOR", you have defined DDF_UNIT_TERMINATOR as 31, normally there is no problem, but if there is a Unicode character (2 bytes) which contains 31, then these string will be truncated. I met this problem with the attached S57 000 file!! In this file, the "RCID = 1394, OBJL = 109" feature contains a "NOBJNM" string attribute, this string contains a chinese unicode character, one byte is equal to '31'. Because I don't have ISO8211 detailed document, I don't know how to change your codes. Hope you can help me! Hope to received your response! Best regards, Yanli
Attachments (2)
Change History (11)
comment:2 by , 17 years ago
Description: | modified (diff) |
---|---|
Milestone: | → 1.4.1 |
Priority: | highest → normal |
Severity: | blocker → normal |
comment:3 by , 17 years ago
Milestone: | 1.4.1 → 1.4.2 |
---|
I do not forsee having time to deal with this for 1.4.1. Pushing to 1.4.2.
comment:4 by , 17 years ago
Milestone: | 1.4.2 → 1.4.3 |
---|
I'm afraid I'm not going to get to this for 1.4.2 either.
comment:5 by , 17 years ago
Here's a temptative patch to address the problem. Here's an extract of the comment put in the patch to explain its rationale :
"In the case of S57, the subfield ATVL of the NATF field can be encoded in lexical level 2 (see S57 specification, Edition 3.1, paragraph 2.4 and 2.5). In that case the Unit Terminator and Field Terminator are followed by the NULL character. A better fix would be to read the NALL tag in the DSSI to check that the lexical level is 2 instead of relying on the value of the first byte as we are doing. "
The test to detect multibyte strings was also missing a cast to unsigned char. Apart from this, the patch should have no consequences on the behaviour on ASCII content.
I think it improves a bit the situation on other datasets. I've also tested it on JP34NC94.000. When doing 'ogrinfo -ro -al ../ENC_ROOT/JP34NC94.000', it removes the warning : "Warning 1: Illegal feature attribute id (NATF:ATTL[1]) of 11264 on feature FIDN=34960548, FIDS=1512. Skipping attribute, no more warnings will be issued." The output of this feature is a bit modified :
@@ -2235,7 +2235,7 @@ VERACC (Real) = (null) VERLEN (Real) = (null) INFORM (String) = (null) - NINFOM (String) = (null) + NINFOM (String) = ���n0�eW[�0��yk0��V��IQh�:yY0�0 NTXTDS (String) = (null) PICREP (String) = (null) SCAMAX (Integer) = (null)
Diff of the output of iso8211 :
@@ -154334,8 +154334,8 @@ Data = `-\01\FF\FE-Nq\4lS\90,{\13\FF\F7Sopnm\19j\1F\00,\01\FF\FE\13\FFn0\87eW[\920...' Subfield `ATTL' = 301 Subfield `ATVL' = `��-Nq\4lS�,{��Sopnmj' - Subfield `ATTL' = 11264 - Subfield `ATVL' = `���n0�eW[�0��yk0��V��IQh�:yY0�0' + Subfield `ATTL' = 300 + Subfield `ATVL' = `���n0�eW[�0��yk0��V��IQh�:yY0�0' DDFField: Tag = `FFPT' DataSize = 21
Similar effect on JP44NVQ4.000 :
@@ -209089,9 +209089,7 @@ DataSize = 14 Data = `-\01\FF\FE\1Fu0u;\9F\1F\00\1E\00' Subfield `ATTL' = 301 - Subfield `ATVL' = `��' - Subfield `ATTL' = 12405 - Subfield `ATVL' = `u;�' + Subfield `ATVL' = `��u0u;�' DDFField: Tag = `FSPT' DataSize = 9
by , 17 years ago
Attachment: | gdal_svn_bug1526.patch added |
---|
Fix bug 1526 (unicode strings in S57 files)
comment:6 by , 17 years ago
Component: | default → OGR_SF |
---|---|
Keywords: | s57 iso8211 unicode added |
Resolution: | → fixed |
Status: | assigned → closed |
comment:7 by , 17 years ago
I've no idea if JP44NVQ4.000 is in the public domain; I just downloaded it from http://www1.kaiho.mlit.go.jp/KOKAI/ENC/English/sample.html. Maybe you should ask to the "Hydrographic and Oceanographic Department,JCG" that seems to be the producer.
comment:8 by , 15 years ago
Cc: | added |
---|---|
Milestone: | 1.4.3 → 1.5.4 |
Resolution: | fixed |
Status: | closed → reopened |
Version: | 1.4.0 → 1.5.0 |
The patch approach depends on testing the first character to see if it is valid in the ASCII character set. However, this is quite a weak test, and I've run into another file (UA4T3402.000) that has a valid ascii character in the first position but that is double byte. It is therefor unreadable.
I think instead we should test if that last two characters are 0x1e 0x00 to identify a field as double character. Preparing an alternate patch.
comment:9 by , 15 years ago
Resolution: | → fixed |
---|---|
Status: | reopened → closed |