Opened 16 years ago

Closed 16 years ago

#1943 closed defect (duplicate)

GML -> SHP Conversion does not return

Reported by: Alex Owned by: Mateusz Łoskot
Priority: normal Milestone:
Component: OGR_SF Version: 1.4.1
Severity: normal Keywords: gml win32 CPLReadLine AVC
Cc: warmerdam

Description

Ogr2ogr starts reading a large GML file with about 4 MB/sec and then decreases more and more. RAM consumption starts at 30 MB and now (after 5 minutes) it is already at 85 MB - still growing. The last working version I've found is 0.9.6. It takes around 7MB RAM and reads the file with 10-15 MB/sec. Already after 4 minutes it starts to write a Shape File (while 1.4.1 needs much more time). After 7 minutes the conversion has finished with a 190 MB Shape File. The command I use is this one: ogr2ogr -skipfailures -f "ESRI Shapefile" DGKLW.shp DGKLW.gml If one of the newer ogr2ogr versions some when returns I do not know. Yesterday I started a conversion but today in the morning my computer did not react. May be ogr2ogr used in the meantime all of the 2 GB RAM and 4 GB swap ...

I will upload the GML and GFS file, so you have a chance to debug.

Attachments (2)

thuer_l_f-longest-line.png (23.3 KB ) - added by Mateusz Łoskot 16 years ago.
Longest line (10th) of 20000/thuer_l_f.gml counted in Vim
gdal-avc-cplreadline-timing.png (11.5 KB ) - added by Mateusz Łoskot 16 years ago.
Simple timing of CPLReadLine() calls from _AVCE00ReadScanE00() under Windows XP

Download all attachments as: .zip

Change History (15)

comment:1 by Alex, 16 years ago

Version: unspecified1.4.1

comment:2 by Alex, 16 years ago

Where can I upload my testcase? I have a ZIP with 19MB ...

comment:3 by Mateusz Łoskot, 16 years ago

Alex,

Please, use my FTP server:

Hostname: ftp.loskot.net Username: gdal@… Password: gdal2007

comment:4 by Mateusz Łoskot, 16 years ago

One additional note about browsing the FTP using web browser. Here is the URL to use:

ftp://gdal@loskot.net@ftp.loskot.net

and you will be asked for password.

comment:5 by Alex, 16 years ago

I've uploaded a zip file ogr2ogr.zip. There are 2 directories packed: 10000 and 20000. Converting the gml in 10000 (with 10000 geometries) costs 2 minutes, converting the one in 20000 (with 20000) costs 9 minutes.

comment:6 by Even Rouault, 16 years ago

I cannot reproduce the problem with 10000 or 20000, neither with GDAL 1.4.1 nor with GDAL 1.5dev. Translation to shape of 20000 takes around 10 seconds on my PC (AMD Athlon(tm) 64 Processor 3200+ / 512 MB RAM)

comment:7 by Alex, 16 years ago

Have you tried it on windows or on linux? My PC should be fast enough: AMD Athlon 64 X2 Dual Core 4200+ with 2G RAM running Windows XP (32bit) Prof. SP 2. I've installed FWTools 2.0.0 but as I've seen, there is the same gdal version (1.5dev) included as in 1.4.1 so the runtime is the same. Do you have any idea, what I should test to get the problem fixed?

comment:8 by warmerdam, 16 years ago

Cc: warmerdam added
Owner: changed from warmerdam to Mateusz Łoskot

Matuesz,

Could you try and reproduce this with a normal windows build, and if that doesn't show the problem, then with fwtools? If the problem only occurs with fwtools, then bounce it back to me. It may be that I need a newer xerces or something.

comment:9 by Mateusz Łoskot, 16 years ago

Status: newassigned

Just for records, I tested both datasets, 10000 and 20000, on Mac OS X 10.4 with success. It takes 4-6 secs to translate the GML files to ESRI Shapefile with success (all features are included in output shapefile).

comment:10 by Mateusz Łoskot, 16 years ago

I confirm the problem exists on Windows. It takes about 15 minutes to translate GML from 20000 to Shapefile (24MB output file) on Virtual Machine with 512 MB RAM hosted on Mac Pro with 2x2.66 GHz Intel Xeon, 5 GB RAM.

comment:11 by Mateusz Łoskot, 16 years ago

There are two problems being a reason of the bad performance:

  1. The first issue is inefficient OGR datasource format detection in OGRSFDriverRegistrar::Open()
  2. The second issue is a big difference in performance of CPLReadLine() function between Unix and Windows
  • Ad. 1.

The function iterates through available drivers and performs try-to-open action. The problem is that the test for AVC driver takes ~10 minutes. The AVC driver does not filter out passed datasource name by file extension (i.e to automatically fail for potentially big but unsupported files .gml, .xml, .kml, etc.), nor it filters by testing first bits of file content (ie. to check if there is XML prolog).

However, this issue is nothing strange and should be possible to fix easily.

  • Ad. 2.

This problem is directly related to the 1st one. The AVC driver, in private function _AVCE00ReadScanE00() iterates through lines of a file with subsequent calls of CPLReadLine(). The test file 20000/thuer_l_f.gml contains only 10 lines but length of the last line is 44628396 characters (see attached thuer_l_f-longest-line.png).

The CPLReadLine() call for this long line behaves very differently on Windows than on Unix. On Unix (Mac OS X) complete read of 44628396 characters takes ~3 seconds, but on Windows it takes ~900 seconds (see gdal-avc-cplreadline-timing.png), regardless if GDAL build in optimized or debug mode.

Fast judgement tells me that there are two bottlenecks: big number of memory allocations and I/O operations.

After the AVC finishes all test actions, translation of the test GML file 20000/thuer_l_f.gml takes a few seconds.

As a simple solution, I'd propose to add simple datasource type filtering by file extension and header content as pointed in 1. Also, if time permits it's a good idea to debug and profile CPLReadLine() for huge lines.

Looking forward your opinion guys.

by Mateusz Łoskot, 16 years ago

Attachment: thuer_l_f-longest-line.png added

Longest line (10th) of 20000/thuer_l_f.gml counted in Vim

by Mateusz Łoskot, 16 years ago

Simple timing of CPLReadLine() calls from _AVCE00ReadScanE00() under Windows XP

comment:12 by warmerdam, 16 years ago

Component: defaultOGR_SF
Keywords: gml win32 CPLReadLine AVC added

The problem with poor detection in the AVC driver was already recently reported as #1989.

The problem with CPLReadLine() being slow on windows may be related to the logic at:

http://trac.osgeo.org/gdal/browser/trunk/gdal/port/cpl_conv.cpp#L365

Please create another bug report for the CPLReadLine() slowness on windows, and see if you can come up with suggestions without breaking the CPLReadLine() support for text mode files. At that point I think we can close this report and pursue the two issues in their distinct tickets.

Alex - thanks for the report leading us to these problems!

comment:13 by Mateusz Łoskot, 16 years ago

Resolution: duplicate
Status: assignedclosed

I reported new ticket about CPLReadLine issue:

#1999 - Slowness of CPLReadLine() function on Windows

I'm closing this ticket as a duplicate of tickets #1989 and #1999.

Note: See TracTickets for help on using tickets.