Ticket #4380 (closed enhancement: fixed)
Support gml:xlink resolving for huge GML files
|Reported by:||rouault||Owned by:||warmerdam|
Hi All, I'm here to submit to your attention an alternative gml:xlink resolving method (GML driver) I've implemented for Tuscany Region. Tuscany Region context and requirements: ======================================== Their brand new official vector cartography is fully topology-based, and GML is the main format adopted as primary source / long term storage. Any possible alternative source (SHP, Spatial DBMS such as PostGIS and SpatiaLite) will be later derived from such primary GML files. And ogr2ogr is the main tool currently used to transfer data between different sources. the whole regional cartography consist in several dozens of huge GML files (> 2GB each one), and each single GML file contains about an hundredth layers and many tenths of ancillary data tables (some million features per single GML file). the complete GML schema is formally defined by an accurate but really complex XSD; actually a very ramified one, because it's widely based on <xs:include> elements. Critical issues found in current ogr2org: ========================================= enabling GML_SKIP_RESOLVE_ELEMS NONE effectively support xlink:href resolution. a) anyway, as the official doc honestly recognizes "the resolution algorithm is not optimised for large files. For files with more than a couple of thousand xlink:href tags, the process can go beyond a few minutes". For real-world GML files (some million xlink:href occurrences) Tuscany Region actually measured discouraging timings (well in excess of 24 hours per single GML file). b) and a physical size limitation exists: the current implementation for GML_SKIP_RESOLVE_ELEMS NONE builds a memory-based GML node-tree, used for xlink:href resolution. This practically means that no GML file bigger than 1GB (approx.) could never be successfully parsed at all, simply because the address space available for 32-bit sw forbids to allocate so much RAM as required by this intensively memory-based method. The proposed GML_SKIP_RESOLVE_ELEMS HUGE alternative: ===================================================== I've carefully evaluated an alternative strategy allowing to skip all the above critical issues, but still preserving absolutely untouched the main core of the current GML implementation. The unique bottleneck I identified simply was using a memory-based GML node-tree. So I started developing an alternative method uniquely focused on xlink:href resolution, not at all requiring to use huge memory allocations, and possibly offering enhanced efficiency and better performances when parsing huge GML files. Please note well: - GML_SKIP_RESOLVE_ELEMS NONE still supports the previous xlink:href resolution method exactely as before - GML_SKIP_RESOLVE_ELEMS HUGE activates the new alternative resolution method - any other facet of the GML driver is left completely untouched; the difference between NONE and HUGE methods simply is in xlink:href resolution and in xxx.resolved.gml output generation. Main design: - using a temporary DB (SQLite) for xlink:href resolution - then outputting an xxxx.resolved.gml file, exactly as the old NONE method does Advantages: - a DBMS is the tool of excellence for massive relational pairs resolution (JOIN): any DBMS surely is carefully optimized for this specific task, there is no need to reinvent the wheel yet another time. - this method is completely file-system-based; a very low memory footprint is required, even when processing impressively huge GML files - SQLite is absolutely simple and light-weighted (and doesn't requires any installation/configuration at all). - an auxiliary SQLite DB-file can be used in the simplest way exactly as it was a trivial temporary file of a more conventional type. - SQLite already is included between GDAL/OGR dependencies, no further complexity and/or dependencies are introduced. Testing and measured timings: ============================= - the current GML_SKIP_RESOLVE_ELEMS NONE method was unable to open any GML file bigger than approx. 1GB (insufficient memory) - the alternative GML_SKIP_RESOLVE_ELEMS HUGE method is able to successfully parse GML files as big as approx. 3GB (there is no theoretically imposed size limit; simply we had no sample bigger than this to test) - the old method required 24+ hours to resolved a medium sized GML file. - the new SQLite-based method requires about 1 hour to resolve a huge 3GB GML file (containing million+ xlink:href pairs). Implementation details: ======================= some further configuration options are available when GML_SKIP_RESOLVE_ELEMS HUGE is set: - if GML_HUGE_TEMPFILE YES is set, the SQLite DB-file used to resolve xlink:href / gml:id relations will be immediately removed as no longer required (i.e. once xxxx.resolved.gml output has been generated). usually this auxiliary DB-file is a huge file too, and is completely useless once the resolved file has already been generated. removing it from the file-system strongly helps to save a lot of precious disk space when processing many really huge GML files one after the other. - if GML_HUGE_SQLITE_UNSAFE YES is set, the auxiliary SQLite DB-file will be configured so to enable any possible performance boost. This includes disabling at all the support for transactions: an intrinsically unsafe option (unable to recover the DB after a fatal system crash), but absolutely harmless in this specific case (this simply is a temporary file, we'll recreate it again from scratch next time). And finally a GML_GFS_TEMPLATE path_to_template.gfs option was introduced supporting GML_SKIP_RESOLVE_ELEMS HUGE Rationale for GML_GFS_TEMPLATE: ------------------------------- - the current OGR GML implementation attempts to 'sniff' the data layout; this practically means that subsequent runs of ogr2ogr will easily guess different layouts for the same class/layer (some attributes are optional, many alphanumeric keys will be wrongly recognized as integers ...). and this one is a painful issue when you have to import many dozens distinct GML files into the same DBMS [-append] - the current XSD support is insufficient, because it doesn't allows to download XSD via URL, and because it doesn't supports <xs:include> directives. - the current GFS support is fairly adequate, once you've carefully hand-written a GFS file exactly corresponding to your complex XSD schema. allowing to unconditionally select the same GFS file for many subsequent import runs surely makes your life easiest, and ensures a strongly consistent data layout to be used everywhere. Proposed patch: =============== not too much complex, after all ... please see the attached svn diff and source: a) -/ogr/ogrsf_frmts/gml/hugefileresolver.cpp full implementation of GML_SKIP_RESOLVE_ELEMS HUGE and related options (mainly SQLite handling stuff) b) -/ogr/ogrsf_frmts/gml/gmlreader.h -/ogr/ogrsf_frmts/gml/gmlreaderp.h -/ogr/ogrsf_frmts/gml/ogrgmldatasource.cpp changing few C++ class definitions and implementing GML_SKIP_RESOLVE_ELEMS HUGE activation c) -/ogr/ogrsf_frmts/gml/GNUmakefile building hugefileresolver.cpp d) -/ogr/ogrsf_frmts/gml/drv_gml.html updated documentation e) -/ogr/ogrsf_frmts/nas/nasreaderp.h -/ogr/ogrsf_frmts/nas/nasreader.cpp simply few formal changes for C++ class definitions so to maintain full compatibility with GML base module. thanks for your attention and consideration, Sandro Furieri
Hi Even, we have many points to be discussed; so for the sake of clarity I'll split my reply in two distinct parts: - code changes (the patch itself) - interesting but hard to be implemented suggestions and other minor detail facets. this one is my reply about the patch main core. > * #include "sqlite3.h" in ogr/ogrsf_frmts/gml/gmlreaderp.h > is not appropriate in that file, and should be only found > in hugefileresolver.cpp. SQLite is a very common dependency > in most GDAL/OGR builds, but not a compulsory one. > So some adjustments will be needed in the GNUmakefile/makefile.vc > to detect if sqlite dependency is really available before > trying to build hugefileresolver.cpp and calling it from > ogrgmldatasource.cpp > all right: I've implemented your suggestion. now you can built OGR even when libsqlite isn't available at all; and in this case a "fake" implementation for few methods will simply report an error (missing SQLite support) I hope to have fixed both GNUmakefile and makefile.vc; anyway a careful check from you surely is welcome. > * GML_GFS_TEMPLATE config option is general purpose, and could > be made also available even when xlink:href resolving is not asked. > nice: the HUGE resolver internally managed the TEMPLATE related stuff. now I've added two further classes [GFSTemplateList and GFSTempalteItem] and a new method [PrescanForTemplate] allowing to make TEMPLATE usage available even when the HUGE resolver isn't actually invoked. > * I feel you could completely remove the GML_HUGE_SQLITE_UNSAFE > config option, and decide it to be TRUE. I don't see any reason > why one would want to have recovery guarantees for a temporary file. > If the process is interrupted, I suppose that the full resolving > must be starting again and the temp file will be recreated : correct ? > absolutely correct: I've completely removed GML_HUGE_SQLITE_UNSAFE; now this one is unconditionally the default implementation. > * Same for GML_HUGE_TEMPFILE. Why would one want to keep the > temporary sqlite DB ? Or perhaps keep it undocumented for just > debugging purposes, and make it default to YES. > now GML_HUGE_TEMPFILE simply is an undocumented debug option; and the default setting is YES > * A bit correlated with the above : the use of int* pbOutIsTempFile > in hugeFileResolver() is confusing, because (contrary to the > resolvexlinks.cpp case) it is not used as an output parameter, but > as an input one > all right, I've corrected this. > * In hugefileresolver.cpp, I think that the hand-parsing of XML is > unnecessary in gmlHugeFileCheckXrefs(). The only caller of > gmlHugeFileCheckXrefs() does a gmlText = CPLSerializeXMLTree((CPLXMLNode *)pszNode) > just before. Why not just exploring the nodes through the CPLXMLNode > objects ? This would be much cleaner and robust (in case the XML > serialization changes a bit), and likely a bit faster too. > accepted: I've completely refactored my code attempting to use as far as possible CPLXMLNode and its methods. and effectively now the code is strongly simplified and better arranged (and I hope too, much more robust and stable). As an interesting side effect, this deep code refactoring caused many further simplifications and optimizations to be applied at SQL level; and the 'final' implementation is about 50% faster than the previous one. > * There might be security issues due to the use of char XXX. > I see quite a few strcpy() with no size checking. I'd suggest using > CPLStrlcpy(). > even better than this: I've carefully removed any fixed-size buffer, using CPLString objects instead. > * A few naming inconsistencies. The z in papszGeomList = > poFeature->GetGeometryList() should be removed because no > strings are used. Same for pszNode = papszGeomList[i]. > really sorry for this :-D just a short consideration: I'm not at all a fan of the so-called Hungarian naming convention (sincerely, I hate it). AFAIK this is widely used on Microsoft own code, and GDAL/OGR is the first open source project I've ever seen extensively using Hungarian. Please, be patient and tolerant about this ... I've surely put many 'wrong' names here and there, simply because I'm completely not accustomed to used such coding style. ;-) > * At line 1793 of hugefileresolver.cpp, VSIFPrintfL ( fp, > " <%s>\%s</%s>\n", --> the \ before %s looks wrong. > sorry for this typo: many thanks to you for noticing it > And I see no use of XML escaping for the value of the attribute > (use CPLEscapeString()) > nice suggestion: accepted
Note: See TracTickets for help on using tickets.