Ticket #1534 (closed defect: fixed)

Opened 1 year ago

Last modified 1 year ago

OGR GML reader fails if file has UTF-8 BOM prefix

Reported by: rogerjames99 Assigned to: mloskot
Priority: normal Milestone: 1.4.1
Component: OGR_SF Version: unspecified
Severity: normal Keywords: UTF BOM GML
Cc: warmerdam

Description (Last modified by warmerdam)

The function OGRGMLDataSource::Open in ogrgmldatasource.cpp fails is the GML file has a UTF-8 encoded UNICODE BOM (Byte order mark) at the start of the file. This is valid UTF-8 encoding (see RFC 3629 section 6) and should be allowed. Xerces properly handles this sequence. The code below is a modification to this function to allow for this.

It may be better to remove the "Test Open" functionality altogether and just let Xerces worry about correctly formed xml.

int OGRGMLDataSource::Open( const char * pszNewName, int bTestOpen )

{
    FILE        *fp;
    char        szHeader[1000];

/* -------------------------------------------------------------------- */
/*      Open the source file.                                           */
/* -------------------------------------------------------------------- */
    fp = VSIFOpen( pszNewName, "r" );
    if( fp == NULL )
    {
        if( !bTestOpen )
            CPLError( CE_Failure, CPLE_OpenFailed, 
                      "Failed to open GML file `%s'.", 
                      pszNewName );

        return FALSE;
    }

/* -------------------------------------------------------------------- */
/*      If we aren't sure it is GML, load a header chunk and check      */
/*      for signs it is GML                                             */
/* -------------------------------------------------------------------- */
    if( bTestOpen )
    {
		char *szPtr = szHeader;
        VSIFRead( szHeader, 1, sizeof(szHeader), fp );
        szHeader[sizeof(szHeader)-1] = '\0';
/* -------------------------------------------------------------------- */
/*      Check for a UTF-8 BOM and skip if found                         */
/* -------------------------------------------------------------------- */
		if (((unsigned char)szPtr[0] == 0xEF) && ((unsigned char)szPtr[1] == 0xBB) && ((unsigned char)szPtr[2] == 0xBF))
			szPtr += 3;

        if( szPtr[0] != '<' 
            || strstr(szPtr,"opengis.net/gml") == NULL )
        {
            VSIFClose( fp );
            return FALSE;
        }
    }

Attachments

ogrgmldatasource.cpp (32.5 kB) - added by rogerjames99 on 03/28/07 09:41:03.
smallsample.gml (3.7 kB) - added by rogerjames99 on 03/28/07 13:36:26.

Change History

03/28/07 09:41:03 changed by rogerjames99

  • attachment ogrgmldatasource.cpp added.

03/28/07 09:42:44 changed by rogerjames99

  • component changed from default to OGR_SF.

Sorry about the typos and munged source. I have attached the modded file from the 1.4.0 source.

03/28/07 12:39:48 changed by warmerdam

  • severity changed from minor to normal.
  • cc set to warmerdam.
  • priority changed from low to normal.
  • milestone set to 1.4.1.
  • owner changed from warmerdam to mloskot.
  • description changed.

Roger,

Could you attach a smallish file with this marker in it?

Mateusz,

Could you fix this trunk and 1.4 branch? We *do* want to keep the pre-checks but they need to be safer. We don't want to start up xerces for every file passed to OGROpen().

Please add a test for this in the test suite.

03/28/07 13:36:26 changed by rogerjames99

  • attachment smallsample.gml added.

03/28/07 13:38:10 changed by rogerjames99

Frank,

smallsample.gml attached.

Roger

04/03/07 17:58:41 changed by mloskot

  • status changed from new to assigned.

The fix proposed in attached patch provides support of BOM indicator only for UTF-8 encoding. BOM is a variable-length mark and depends on actual encoding of XML file. In future, it's reasonable to add support for other XML encodings.

04/03/07 18:01:35 changed by mloskot

  • status changed from assigned to closed.
  • resolution set to fixed.

The fix has been applied (r11191).

Also, ogr_gml_read.py test suite has been updated with new case reading GML with BOM indicator (bom.gml).

04/03/07 18:30:59 changed by mloskot

The fix has been also submitted to stable branch 1.4 (r11193)