Opened 13 years ago

Closed 11 years ago

#1526 closed defect (fixed)

Problem with widecharacter ISO8211 parsing

Reported by: warmerdam Owned by: warmerdam
Priority: normal Milestone: 1.5.4
Component: OGR_SF Version: 1.5.0
Severity: normal Keywords: s57 iso8211 unicode
Cc: Even Rouault

Description (last modified by warmerdam)

Dear Warmerdam,

I am a programmer in China, and are using your ISO8211 lib to develop
a S57 parser program for a customer. I found a Unicode string decoding problem
as following:

 In your program you are using "chFormatDelimeter" as the splitting character to let user call "ExtractStringData(pszData, iBytesLeft, &iBytesConsumed)", I also found "chFormatDelimeter = DDF_UNIT_TERMINATOR", you have defined DDF_UNIT_TERMINATOR
as 31, normally there is no problem, but if there is a Unicode character (2 bytes)
which contains 31, then these string will be truncated.
I met this problem with the attached S57 000 file!! In this file, the "RCID = 1394,
OBJL = 109" feature contains a  "NOBJNM" string attribute, this string contains a chinese
unicode character, one byte is equal to '31'.

Because I don't have ISO8211 detailed document, I don't know how to change your codes.
Hope you can help me!

Hope to received your response!

Best regards,
Yanli

Attachments (2)

bug.000 (1.8 KB) - added by warmerdam 13 years ago.
Subset of S-57 file demonstrating bug
gdal_svn_bug1526.patch (3.5 KB) - added by Even Rouault 12 years ago.
Fix bug 1526 (unicode strings in S57 files)

Download all attachments as: .zip

Change History (11)

comment:1 Changed 13 years ago by warmerdam

I have reproduced this problem with the GDAL 1.4.0 version of ISO8211lib. 

I have reduced the problem file to just the header record and the broken 
record, and have attached this data problem looks like this:

The 8211dump output for the 
  DDFField:
      Tag = `NATF'
      DataSize = 40
      Data = `-\01\C7\86\E3S\F3\81\E0swm\D8\9A\1F\90\A2[9\82\A8cP\83*\82\BF~\08\FF\17S\BF~\09\FF\1F\00\1E\00'
      Subfield `ATTL' = 301
      Subfield `ATVL' = `dž��swmؚ'
      Subfield `ATTL' = 41616
      Subfield `ATVL' = `[9��cP�*���S�~ �'

There should be no 41616 ATTL value - this is due to the improper break
of the first subfield due to the terminator interpretation.

I'm not sure when I will get to fixing this, but I'd like to try and
address it before the GDAL 1.4.1 release as it adversely affects the
OGR S-57 driver.

Changed 13 years ago by warmerdam

Attachment: bug.000 added

Subset of S-57 file demonstrating bug

comment:2 Changed 13 years ago by warmerdam

Description: modified (diff)
Milestone: 1.4.1
Priority: highestnormal
Severity: blockernormal

comment:3 Changed 13 years ago by warmerdam

Milestone: 1.4.11.4.2

I do not forsee having time to deal with this for 1.4.1. Pushing to 1.4.2.

comment:4 Changed 12 years ago by warmerdam

Milestone: 1.4.21.4.3

I'm afraid I'm not going to get to this for 1.4.2 either.

comment:5 Changed 12 years ago by Even Rouault

Here's a temptative patch to address the problem. Here's an extract of the comment put in the patch to explain its rationale :

"In the case of S57, the subfield ATVL of the NATF field can be encoded in lexical level 2 (see S57 specification, Edition 3.1, paragraph 2.4 and 2.5). In that case the Unit Terminator and Field Terminator are followed by the NULL character. A better fix would be to read the NALL tag in the DSSI to check that the lexical level is 2 instead of relying on the value of the first byte as we are doing. "

The test to detect multibyte strings was also missing a cast to unsigned char. Apart from this, the patch should have no consequences on the behaviour on ASCII content.

I think it improves a bit the situation on other datasets. I've also tested it on JP34NC94.000. When doing 'ogrinfo -ro -al ../ENC_ROOT/JP34NC94.000', it removes the warning : "Warning 1: Illegal feature attribute id (NATF:ATTL[1]) of 11264 on feature FIDN=34960548, FIDS=1512. Skipping attribute, no more warnings will be issued." The output of this feature is a bit modified :

@@ -2235,7 +2235,7 @@
   VERACC (Real) = (null)
   VERLEN (Real) = (null)
   INFORM (String) = (null)
-  NINFOM (String) = (null)
+  NINFOM (String) = ���n0�eW[�0��yk0��V��IQh�:yY0�0
   NTXTDS (String) = (null)
   PICREP (String) = (null)
   SCAMAX (Integer) = (null)

Diff of the output of iso8211 :

@@ -154334,8 +154334,8 @@
       Data = `-\01\FF\FE-Nq\4lS\90,{\13\FF\F7Sopnm\19j\1F\00,\01\FF\FE\13\FFn0\87eW[\920...'
       Subfield `ATTL' = 301
       Subfield `ATVL' = `��-Nq\4lS�,{��Sopnmj'
-      Subfield `ATTL' = 11264
-      Subfield `ATVL' = `���n0�eW[�0��yk0��V��IQh�:yY0�0'
+      Subfield `ATTL' = 300
+      Subfield `ATVL' = `���n0�eW[�0��yk0��V��IQh�:yY0�0'
   DDFField:
       Tag = `FFPT'
       DataSize = 21

Similar effect on JP44NVQ4.000 :

@@ -209089,9 +209089,7 @@
       DataSize = 14
       Data = `-\01\FF\FE\1Fu0u;\9F\1F\00\1E\00'
       Subfield `ATTL' = 301
-      Subfield `ATVL' = `��'
-      Subfield `ATTL' = 12405
-      Subfield `ATVL' = `u;�'
+      Subfield `ATVL' = `��u0u;�'
   DDFField:
       Tag = `FSPT'
       DataSize = 9

Changed 12 years ago by Even Rouault

Attachment: gdal_svn_bug1526.patch added

Fix bug 1526 (unicode strings in S57 files)

comment:6 Changed 12 years ago by warmerdam

Component: defaultOGR_SF
Keywords: s57 iso8211 unicode added
Resolution: fixed
Status: assignedclosed

I concur with your analysis, and the fix. Applied in trunk (r12002) and 1.4 (r12003).

By any chance is JP44NVQ4.000 in the public domain so it could be used in the test suite?

Even, you rock.

comment:7 Changed 12 years ago by Even Rouault

I've no idea if JP44NVQ4.000 is in the public domain; I just downloaded it from http://www1.kaiho.mlit.go.jp/KOKAI/ENC/English/sample.html. Maybe you should ask to the "Hydrographic and Oceanographic Department,JCG" that seems to be the producer.

comment:8 Changed 11 years ago by warmerdam

Cc: Even Rouault added
Milestone: 1.4.31.5.4
Resolution: fixed
Status: closedreopened
Version: 1.4.01.5.0

The patch approach depends on testing the first character to see if it is valid in the ASCII character set. However, this is quite a weak test, and I've run into another file (UA4T3402.000) that has a valid ascii character in the first position but that is double byte. It is therefor unreadable.

I think instead we should test if that last two characters are 0x1e 0x00 to identify a field as double character. Preparing an alternate patch.

comment:9 Changed 11 years ago by warmerdam

Resolution: fixed
Status: reopenedclosed

I have updated to identify double byte values based on the field ending with 0x1e 0x00. Changed in trunk (r15632) and 1.5 branch (r15634). Also added an autotest test of sorts (r15633). I should note that the NINFOM value is still not properly converted to UTF-8 as it ought to be.

Note: See TracTickets for help on using tickets.