Ticket #1943 (closed defect: duplicate)

Opened 9 months ago

Last modified 8 months ago

GML -> SHP Conversion does not return

Reported by: Alex Assigned to: mloskot
Priority: normal Milestone:
Component: OGR_SF Version: 1.4.1
Severity: normal Keywords: gml win32 CPLReadLine AVC
Cc: warmerdam

Description

Ogr2ogr starts reading a large GML file with about 4 MB/sec and then decreases more and more. RAM consumption starts at 30 MB and now (after 5 minutes) it is already at 85 MB - still growing. The last working version I've found is 0.9.6. It takes around 7MB RAM and reads the file with 10-15 MB/sec. Already after 4 minutes it starts to write a Shape File (while 1.4.1 needs much more time). After 7 minutes the conversion has finished with a 190 MB Shape File. The command I use is this one: ogr2ogr -skipfailures -f "ESRI Shapefile" DGKLW.shp DGKLW.gml If one of the newer ogr2ogr versions some when returns I do not know. Yesterday I started a conversion but today in the morning my computer did not react. May be ogr2ogr used in the meantime all of the 2 GB RAM and 4 GB swap ...

I will upload the GML and GFS file, so you have a chance to debug.

Attachments

thuer_l_f-longest-line.png (23.3 kB) - added by mloskot on 11/16/07 10:17:32.
Longest line (10th) of 20000/thuer_l_f.gml counted in Vim
gdal-avc-cplreadline-timing.png (11.5 kB) - added by mloskot on 11/16/07 10:18:11.
Simple timing of CPLReadLine() calls from _AVCE00ReadScanE00() under Windows XP

Change History

10/26/07 05:06:25 changed by Alex

  • version changed from unspecified to 1.4.1.

10/26/07 09:31:44 changed by Alex

Where can I upload my testcase? I have a ZIP with 19MB ...

10/26/07 09:54:56 changed by mloskot

Alex,

Please, use my FTP server:

Hostname: ftp.loskot.net Username: gdal@loskot.net Password: gdal2007

10/26/07 10:08:52 changed by mloskot

One additional note about browsing the FTP using web browser. Here is the URL to use:

ftp://gdal@loskot.net@ftp.loskot.net

and you will be asked for password.

10/26/07 10:51:19 changed by Alex

I've uploaded a zip file ogr2ogr.zip. There are 2 directories packed: 10000 and 20000. Converting the gml in 10000 (with 10000 geometries) costs 2 minutes, converting the one in 20000 (with 20000) costs 9 minutes.

11/11/07 15:14:26 changed by rouault

I cannot reproduce the problem with 10000 or 20000, neither with GDAL 1.4.1 nor with GDAL 1.5dev. Translation to shape of 20000 takes around 10 seconds on my PC (AMD Athlon(tm) 64 Processor 3200+ / 512 MB RAM)

11/15/07 07:48:20 changed by Alex

Have you tried it on windows or on linux? My PC should be fast enough: AMD Athlon 64 X2 Dual Core 4200+ with 2G RAM running Windows XP (32bit) Prof. SP 2. I've installed FWTools 2.0.0 but as I've seen, there is the same gdal version (1.5dev) included as in 1.4.1 so the runtime is the same. Do you have any idea, what I should test to get the problem fixed?

11/15/07 13:39:07 changed by warmerdam

  • owner changed from warmerdam to mloskot.
  • cc set to warmerdam.

Matuesz,

Could you try and reproduce this with a normal windows build, and if that doesn't show the problem, then with fwtools? If the problem only occurs with fwtools, then bounce it back to me. It may be that I need a newer xerces or something.

11/16/07 00:01:33 changed by mloskot

  • status changed from new to assigned.

Just for records, I tested both datasets, 10000 and 20000, on Mac OS X 10.4 with success. It takes 4-6 secs to translate the GML files to ESRI Shapefile with success (all features are included in output shapefile).

11/16/07 01:39:07 changed by mloskot

I confirm the problem exists on Windows. It takes about 15 minutes to translate GML from 20000 to Shapefile (24MB output file) on Virtual Machine with 512 MB RAM hosted on Mac Pro with 2x2.66 GHz Intel Xeon, 5 GB RAM.

11/16/07 10:16:54 changed by mloskot

There are two problems being a reason of the bad performance:

  1. The first issue is inefficient OGR datasource format detection in OGRSFDriverRegistrar::Open()
  2. The second issue is a big difference in performance of CPLReadLine() function between Unix and Windows
  • Ad. 1.

The function iterates through available drivers and performs try-to-open action. The problem is that the test for AVC driver takes ~10 minutes. The AVC driver does not filter out passed datasource name by file extension (i.e to automatically fail for potentially big but unsupported files .gml, .xml, .kml, etc.), nor it filters by testing first bits of file content (ie. to check if there is XML prolog).

However, this issue is nothing strange and should be possible to fix easily.

  • Ad. 2.

This problem is directly related to the 1st one. The AVC driver, in private function _AVCE00ReadScanE00() iterates through lines of a file with subsequent calls of CPLReadLine(). The test file 20000/thuer_l_f.gml contains only 10 lines but length of the last line is 44628396 characters (see attached thuer_l_f-longest-line.png).

The CPLReadLine() call for this long line behaves very differently on Windows than on Unix. On Unix (Mac OS X) complete read of 44628396 characters takes ~3 seconds, but on Windows it takes ~900 seconds (see gdal-avc-cplreadline-timing.png), regardless if GDAL build in optimized or debug mode.

Fast judgement tells me that there are two bottlenecks: big number of memory allocations and I/O operations.

After the AVC finishes all test actions, translation of the test GML file 20000/thuer_l_f.gml takes a few seconds.

As a simple solution, I'd propose to add simple datasource type filtering by file extension and header content as pointed in 1. Also, if time permits it's a good idea to debug and profile CPLReadLine() for huge lines.

Looking forward your opinion guys.

11/16/07 10:17:32 changed by mloskot

  • attachment thuer_l_f-longest-line.png added.

Longest line (10th) of 20000/thuer_l_f.gml counted in Vim

11/16/07 10:18:11 changed by mloskot

  • attachment gdal-avc-cplreadline-timing.png added.

Simple timing of CPLReadLine() calls from _AVCE00ReadScanE00() under Windows XP

11/16/07 10:47:03 changed by warmerdam

  • keywords set to gml win32 CPLReadLine AVC.
  • component changed from default to OGR_SF.

The problem with poor detection in the AVC driver was already recently reported as #1989.

The problem with CPLReadLine() being slow on windows may be related to the logic at:

http://trac.osgeo.org/gdal/browser/trunk/gdal/port/cpl_conv.cpp#L365

Please create another bug report for the CPLReadLine() slowness on windows, and see if you can come up with suggestions without breaking the CPLReadLine() support for text mode files. At that point I think we can close this report and pursue the two issues in their distinct tickets.

Alex - thanks for the report leading us to these problems!

11/16/07 11:33:30 changed by mloskot

  • status changed from assigned to closed.
  • resolution set to duplicate.

I reported new ticket about CPLReadLine issue:

#1999 - Slowness of CPLReadLine() function on Windows

I'm closing this ticket as a duplicate of tickets #1989 and #1999.