Ticket #1526 (closed defect: fixed)

Opened 6 years ago

Last modified 5 years ago

Problem with widecharacter ISO8211 parsing

Reported by: warmerdam Owned by: warmerdam
Priority: normal Milestone: 1.5.4
Component: OGR_SF Version: 1.5.0
Severity: normal Keywords: s57 iso8211 unicode
Cc: rouault

Description (last modified by warmerdam) (diff)

Dear Warmerdam,

I am a programmer in China, and are using your ISO8211 lib to develop
a S57 parser program for a customer. I found a Unicode string decoding problem
as following:

 In your program you are using "chFormatDelimeter" as the splitting character to let user call "ExtractStringData(pszData, iBytesLeft, &iBytesConsumed)", I also found "chFormatDelimeter = DDF_UNIT_TERMINATOR", you have defined DDF_UNIT_TERMINATOR
as 31, normally there is no problem, but if there is a Unicode character (2 bytes)
which contains 31, then these string will be truncated.
I met this problem with the attached S57 000 file!! In this file, the "RCID = 1394,
OBJL = 109" feature contains a  "NOBJNM" string attribute, this string contains a chinese
unicode character, one byte is equal to '31'.

Because I don't have ISO8211 detailed document, I don't know how to change your codes.
Hope you can help me!

Hope to received your response!

Best regards,
Yanli

Attachments

bug.000 Download (1.8 KB) - added by warmerdam 6 years ago.
Subset of S-57 file demonstrating bug
gdal_svn_bug1526.patch Download (3.5 KB) - added by rouault 6 years ago.
Fix bug 1526 (unicode strings in S57 files)

Change History

Changed 6 years ago by warmerdam

I have reproduced this problem with the GDAL 1.4.0 version of ISO8211lib. 

I have reduced the problem file to just the header record and the broken 
record, and have attached this data problem looks like this:

The 8211dump output for the 
  DDFField:
      Tag = `NATF'
      DataSize = 40
      Data = `-\01\C7\86\E3S\F3\81\E0swm\D8\9A\1F\90\A2[9\82\A8cP\83*\82\BF~\08\FF\17S\BF~\09\FF\1F\00\1E\00'
      Subfield `ATTL' = 301
      Subfield `ATVL' = `dž��swmؚ'
      Subfield `ATTL' = 41616
      Subfield `ATVL' = `[9��cP�*���S�~ �'

There should be no 41616 ATTL value - this is due to the improper break
of the first subfield due to the terminator interpretation.

I'm not sure when I will get to fixing this, but I'd like to try and
address it before the GDAL 1.4.1 release as it adversely affects the
OGR S-57 driver.

Changed 6 years ago by warmerdam

Subset of S-57 file demonstrating bug

Changed 6 years ago by warmerdam

  • priority changed from highest to normal
  • severity changed from blocker to normal
  • description modified (diff)
  • milestone set to 1.4.1

Changed 6 years ago by warmerdam

  • milestone changed from 1.4.1 to 1.4.2

I do not forsee having time to deal with this for 1.4.1. Pushing to 1.4.2.

Changed 6 years ago by warmerdam

  • milestone changed from 1.4.2 to 1.4.3

I'm afraid I'm not going to get to this for 1.4.2 either.

Changed 6 years ago by rouault

Here's a temptative patch to address the problem. Here's an extract of the comment put in the patch to explain its rationale :

"In the case of S57, the subfield ATVL of the NATF field can be encoded in lexical level 2 (see S57 specification, Edition 3.1, paragraph 2.4 and 2.5). In that case the Unit Terminator and Field Terminator are followed by the NULL character. A better fix would be to read the NALL tag in the DSSI to check that the lexical level is 2 instead of relying on the value of the first byte as we are doing. "

The test to detect multibyte strings was also missing a cast to unsigned char. Apart from this, the patch should have no consequences on the behaviour on ASCII content.

I think it improves a bit the situation on other datasets. I've also tested it on JP34NC94.000. When doing 'ogrinfo -ro -al ../ENC_ROOT/JP34NC94.000', it removes the warning : "Warning 1: Illegal feature attribute id (NATF:ATTL[1]) of 11264 on feature FIDN=34960548, FIDS=1512. Skipping attribute, no more warnings will be issued." The output of this feature is a bit modified :

@@ -2235,7 +2235,7 @@
   VERACC (Real) = (null)
   VERLEN (Real) = (null)
   INFORM (String) = (null)
-  NINFOM (String) = (null)
+  NINFOM (String) = ���n0�eW[�0��yk0��V��IQh�:yY0�0
   NTXTDS (String) = (null)
   PICREP (String) = (null)
   SCAMAX (Integer) = (null)

Diff of the output of iso8211 :

@@ -154334,8 +154334,8 @@
       Data = `-\01\FF\FE-Nq\4lS\90,{\13\FF\F7Sopnm\19j\1F\00,\01\FF\FE\13\FFn0\87eW[\920...'
       Subfield `ATTL' = 301
       Subfield `ATVL' = `��-Nq\4lS�,{��Sopnmj'
-      Subfield `ATTL' = 11264
-      Subfield `ATVL' = `���n0�eW[�0��yk0��V��IQh�:yY0�0'
+      Subfield `ATTL' = 300
+      Subfield `ATVL' = `���n0�eW[�0��yk0��V��IQh�:yY0�0'
   DDFField:
       Tag = `FFPT'
       DataSize = 21

Similar effect on JP44NVQ4.000 :

@@ -209089,9 +209089,7 @@
       DataSize = 14
       Data = `-\01\FF\FE\1Fu0u;\9F\1F\00\1E\00'
       Subfield `ATTL' = 301
-      Subfield `ATVL' = `��'
-      Subfield `ATTL' = 12405
-      Subfield `ATVL' = `u;�'
+      Subfield `ATVL' = `��u0u;�'
   DDFField:
       Tag = `FSPT'
       DataSize = 9

Changed 6 years ago by rouault

Fix bug 1526 (unicode strings in S57 files)

Changed 6 years ago by warmerdam

  • keywords s57 iso8211 unicode added
  • resolution set to fixed
  • status changed from assigned to closed
  • component changed from default to OGR_SF

I concur with your analysis, and the fix. Applied in trunk (r12002) and 1.4 (r12003).

By any chance is JP44NVQ4.000 in the public domain so it could be used in the test suite?

Even, you rock.

Changed 6 years ago by rouault

I've no idea if JP44NVQ4.000 is in the public domain; I just downloaded it from  http://www1.kaiho.mlit.go.jp/KOKAI/ENC/English/sample.html. Maybe you should ask to the "Hydrographic and Oceanographic Department,JCG" that seems to be the producer.

Changed 5 years ago by warmerdam

  • cc rouault added
  • status changed from closed to reopened
  • version changed from 1.4.0 to 1.5.0
  • resolution fixed deleted
  • milestone changed from 1.4.3 to 1.5.4

The patch approach depends on testing the first character to see if it is valid in the ASCII character set. However, this is quite a weak test, and I've run into another file (UA4T3402.000) that has a valid ascii character in the first position but that is double byte. It is therefor unreadable.

I think instead we should test if that last two characters are 0x1e 0x00 to identify a field as double character. Preparing an alternate patch.

Changed 5 years ago by warmerdam

  • status changed from reopened to closed
  • resolution set to fixed

I have updated to identify double byte values based on the field ending with 0x1e 0x00. Changed in trunk (r15632) and 1.5 branch (r15634). Also added an autotest test of sorts (r15633). I should note that the NINFOM value is still not properly converted to UTF-8 as it ought to be.

Note: See TracTickets for help on using tickets.