#1534 closed defect (fixed)
OGR GML reader fails if file has UTF-8 BOM prefix
Reported by: | rogerjames99 | Owned by: | Mateusz Łoskot |
---|---|---|---|
Priority: | normal | Milestone: | 1.4.1 |
Component: | OGR_SF | Version: | unspecified |
Severity: | normal | Keywords: | UTF BOM GML |
Cc: | warmerdam |
Description (last modified by )
The function OGRGMLDataSource::Open in ogrgmldatasource.cpp fails is the GML file has a UTF-8 encoded UNICODE BOM (Byte order mark) at the start of the file. This is valid UTF-8 encoding (see RFC 3629 section 6) and should be allowed. Xerces properly handles this sequence. The code below is a modification to this function to allow for this.
It may be better to remove the "Test Open" functionality altogether and just let Xerces worry about correctly formed xml.
int OGRGMLDataSource::Open( const char * pszNewName, int bTestOpen ) { FILE *fp; char szHeader[1000]; /* -------------------------------------------------------------------- */ /* Open the source file. */ /* -------------------------------------------------------------------- */ fp = VSIFOpen( pszNewName, "r" ); if( fp == NULL ) { if( !bTestOpen ) CPLError( CE_Failure, CPLE_OpenFailed, "Failed to open GML file `%s'.", pszNewName ); return FALSE; } /* -------------------------------------------------------------------- */ /* If we aren't sure it is GML, load a header chunk and check */ /* for signs it is GML */ /* -------------------------------------------------------------------- */ if( bTestOpen ) { char *szPtr = szHeader; VSIFRead( szHeader, 1, sizeof(szHeader), fp ); szHeader[sizeof(szHeader)-1] = '\0'; /* -------------------------------------------------------------------- */ /* Check for a UTF-8 BOM and skip if found */ /* -------------------------------------------------------------------- */ if (((unsigned char)szPtr[0] == 0xEF) && ((unsigned char)szPtr[1] == 0xBB) && ((unsigned char)szPtr[2] == 0xBF)) szPtr += 3; if( szPtr[0] != '<' || strstr(szPtr,"opengis.net/gml") == NULL ) { VSIFClose( fp ); return FALSE; } }
Attachments (2)
Change History (8)
by , 17 years ago
Attachment: | ogrgmldatasource.cpp added |
---|
comment:2 by , 17 years ago
Component: | default → OGR_SF |
---|
comment:3 by , 17 years ago
Cc: | added |
---|---|
Description: | modified (diff) |
Milestone: | → 1.4.1 |
Owner: | changed from | to
Priority: | low → normal |
Severity: | minor → normal |
Roger,
Could you attach a smallish file with this marker in it?
Mateusz,
Could you fix this trunk and 1.4 branch? We *do* want to keep the pre-checks but they need to be safer. We don't want to start up xerces for every file passed to OGROpen().
Please add a test for this in the test suite.
by , 17 years ago
Attachment: | smallsample.gml added |
---|
comment:5 by , 17 years ago
Status: | new → assigned |
---|
The fix proposed in attached patch provides support of BOM indicator only for UTF-8 encoding. BOM is a variable-length mark and depends on actual encoding of XML file. In future, it's reasonable to add support for other XML encodings.
comment:6 by , 17 years ago
Resolution: | → fixed |
---|---|
Status: | assigned → closed |
The fix has been applied (r11191).
Also, ogr_gml_read.py test suite has been updated with new case reading GML with BOM indicator (bom.gml).
Sorry about the typos and munged source. I have attached the modded file from the 1.4.0 source.