Opened 12 years ago

Closed 12 years ago

Last modified 12 years ago

#4389 closed defect (fixed)

OGR/GML: bug in expat-based parsing

Reported by: esseffe Owned by: warmerdam
Priority: normal Milestone:
Component: default Version: svn-trunk
Severity: critical Keywords:
Cc: a.furieri@…

Description

I've identified a flaw in the expat-based GML parser. This issue causes a "File probably corrupted (million laugh pattern)" fatal error; but the GML being parsed is absolutely neat and clean. Please note: there is not Geometry at all in this GML; but it's a fully legal condition.

That's not all: the same identical GML file is successfully parsed on Linux, but miserably fails on Windows.

After further investigation, I've discovered that this issue is usually related with very short tag-names: when tag-names are quite long (=verbose, as usual in GML), this puzzling issue will disappear at all.

I've created two sample GML files [attached] useful to reproduce this issue:

ogr2ogr -f "ESRI Shapefile" bad_shp bad.gml [*** failure ***]
ogr2ogr -f "ESRI Shapefile" good_shp good.gml [*** success ***]

these two GML samples are strictly similar; the unique difference is in tag-names length: very short for bad.gml (actually, a single letter), and rather verbose for good.gml

Exploring the code, I've then discovered that the real cause surely is a bug in the GMLExpatHandler::dataHandlerCbk() implementation:

pThis->m_nDataHandlerCounter ++; 
if (pThis->m_nDataHandlerCounter >= BUFSIZ)
{
    *** ERROR ***
}

Please note well: dataHandlerCbk() is a callback function invoked by expat each time some XML TEXT string has just been retrieved by the parser; so m_nDataHandlerCounter is increased "for each string".

comparing this string-counter to BUFSIZ (the block size internally used by expat to perform an efficient filestream access) seems like comparing apples and oranges; and surely this arbitrary comparison is the actual cause explaining any puzzling issue.
Please note: on Linux BUFSIZ=8192, but on Windows BUFSIZ=512 (and this fully explains why the same GML file fails on one system and succeeds on the other one)

I've not been able to fully understand the intended logic for m_nDataHandlerCounter: I suppose that this variable was originally intended as a precaution against possible buffer overflows in the case of badly formatted XML. Anyway the current implementation is completely bogus; and may cause false-negative issues when parsing "clean" GML

Just as a possible suggestion: on my test case simply removing the complete

if (pThis->m_nDataHandlerCounter >= BUFSIZ) 
{ 
  ... 
}

code block successfully resolves any "crazy" issue. But I strongly doubt that this one is a too much brutal and simplistic solution.

bye, Sandro Furieri

Attachments (1)

gml-samples.zip (12.5 KB ) - added by esseffe 12 years ago.
GML samples showing the issue

Download all attachments as: .zip

Change History (3)

by esseffe, 12 years ago

Attachment: gml-samples.zip added

GML samples showing the issue

comment:1 by Even Rouault, 12 years ago

Resolution: fixed
Status: newclosed
Version: unspecifiedsvn-trunk

I've added more comments in the code, so I hope the intent is better understood.

This is a trunk regression because in trunk the size of the buffer has been expanded, but not the value used for the attack detection. Both values must be strongly correlated. Should be more robust now.

r23566 /trunk/gdal/ogr/ogrsf_frmts/gml/ (gmlhandler.cpp gmlreader.cpp gmlreaderp.h): GML: fix abusive detection of billion laugh attack (trunk only, #4389)

comment:2 by esseffe, 12 years ago

Hi Even,

thanks a lot; after reading your clear and detailed explanation I'm now able to fully understand the underlying logic.

sorry, I was completely unaware of such "XML bomb" attacks: so I was completely unable to imagine a valid reason to compare GML TEXT items count and the BUFSIZ constant.
all right, now it's all absolutely clear.

thanks again, Sandro

Note: See TracTickets for help on using tickets.