Opened 16 years ago

Closed 16 years ago

Last modified 16 years ago

#2428 closed defect (fixed)

shapefile crash on huge polygon

Reported by: richengel Owned by: Even Rouault
Priority: normal Milestone: 1.5.4
Component: OGR_SF Version: 1.5.1
Severity: normal Keywords: shape
Cc:

Description

Attempting to read shapefile crashes in OGRGeometryFactory::organizePolygons for a particular shapefile. The shapefile has 1 polygon with 2514199 vertices and 43688 parts, according to shpdump. Get SIGABRT from new[] on line #806 in ogrgeometryfactory.cpp. Here i = 23246, so it apparently exhausted available memory in building the relations table. Testing done with a 32 bit version.

Change History (13)

comment:1 by Even Rouault, 16 years ago

Owner: changed from warmerdam to Even Rouault

I'll try to address that. Needs a bit of thinking to see how we can avoid the N2 memory consumption where N is the number of parts... (The number of vertices has no influence, except for speed maybe). I think that we'll have to switch back to more simplistic (thus approximative) algorithm when N > 100 for example.

comment:2 by Even Rouault, 16 years ago

I'm curious to see what such a huge polygon looks like : many islands, many holes in a big polygon ? Would it be possible to make it available to download ?

comment:3 by richengel, 16 years ago

The feature is probably better characterized as a multipolygon. One large polygon consists of the boundary of the 48 contiguous US states. There are a few lakes and a large number of islands near the shore. The handling of the Great Lakes is somewhat strange. The border here is is sort of a partial loose-fitting convex hull which includes the lakes entirely inside the big polygon. There are islands in the Great Lakes, but none seem to have their own lakes. I attempted to attach a gzipped tar file, but it was too big (27 M). Perhaps an FTP site would help.

comment:4 by Even Rouault, 16 years ago

Yes 27MB is too big for attachments in a ticket here. I would definitely be interested in having your file as a stressing test for my attemps at improving the algorithm. If you can't upload it on a FTP server, maybe you can try to send it to this yahoo email account : even dot rouault at yahoo dot fr (no guarantee that it accepts big attachments)

My ideas for the fix is in two steps :

  • improve the way the current algorithm works so we can use just O(N) memory, instead of O(N*N). However, the computation complexity will still by O(N*N). For N=46388, I'm afraid that will be painfully slow.
  • add a simplified version of the algorithm that relies heavily on the order of the part. If the following constraints are respected, we can hope linear time. Let's define B <= A, as meaning B is enclosed in A (== B is a lake of A). The constraints are :
     (1) An enclosing polygon A must be placed in the list before its enclosed polygons B1, B2...
     (2) An enclosed polygon B of a polygon A cannot be after a not related polygon C.
     (3) The maximum enclosing depth level is 1, that is to say there are no situations like C <= B <= A

A bit of ASCII art to make things clearer :

XXXXXXXXXXXXXXXXXXXX       XXXXXXXXXXXXXXXXXXXX
X                  X       X                  X
X   A              X       X   C              X
X                  X       X                  X
X  XXXXXX  XXXXXX  X       X  XXXXXX  XXXXXX  X 
X  X    X  X    X  X       X  X    X  X    X  X
X  X B1 X  X B2 X  X       X  X D1 X  X D2 X  X
X  X    X  X    X  X       X  X    X  X    X  X
X  XXXXXX  XXXXXX  X       X  XXXXXX  XXXXXX  X
X                  X       X                  X
XXXXXXXXXXXXXXXXXXXX       XXXXXXXXXXXXXXXXXXXX

The acceptable inputs for the simplified version of the algorithms are : (A,B1,B2,C,D1,D2) or (A,B2,B1,C,D1,D2) or (A,B1,B2,C,D2,D1) or (A,B2,B1,C,D2,D1) or (C,D1,D2,A,B1,B2) or (C,D1,D2,A,B2,B1) or (C,D2,D1,A,B1,B2,) or (C,D2,D1,A,B2,B1) And that's all I think !

In your description of how your multipolygon looks like (if I've well understood, what I'm unsure about the Great Lakes), I'm afraid that we cannot rely on (3), sad... (3) is a necessary condition to guaranty linear performance, but we could probably leave without it at the expense of more computations. However (1) and (2) are absolutely necessary. I'd like to be sure that your test case at least respect (1) and (2) before trying to implement that.

comment:5 by Even Rouault, 16 years ago

In trunk, in r14727, I've commited an improved version of the current algorithm that only consumes O(N) bytes of memory. Speed is probably also improved by a pre-sort by area.

comment:6 by richengel, 16 years ago

Thanks, I'll test the fix with my data. The file was too big for my email to send and I haven't found a suitable public ftp drop site. Your reading of my comments on the Great lakes is correct, we have 2 levels of nesting there.

comment:7 by Even Rouault, 16 years ago

Component: defaultOGR_SF
Keywords: shape added
Milestone: 1.5.3

Fix backported in branches/1.5 in r14734

comment:8 by richengel, 16 years ago

I tested on my problem data and this version now works. A couple of comments. Wasn't the purpose of organizePolygons to handle multiple levels of nesting?

I did some timing tests on a couple of data sets. I conditionally disabled the block for organizePolygons in favor of the subsequent block

OGRGeometryFactory::organizePolygons does not require GOES anymore else if( CSLTestBoolean(CPLGetConfigOption( "OGR_FULL_POLY", "NO" ) ) )

My test reads all the features in a layer. It took 1575 s elapsed for the organizePolygons, while it took 11.4 s for the subsequent block on the problem data set (nos80K). For another data set (2240 features, 4433 parts, 534322 points, with most in one large polygon with a lot of holes) it took 16.2 s for theorganizePolygons block and 0.1 s for the subsequent block (mideast). nos80k has islands in lakes and these are seen with both branches. organizePolygons only seems to address ponds in islands in lakes situation.

comment:9 by Even Rouault, 16 years ago

Resolution: fixed
Status: newclosed

I finally managed to find and download this famous huge polygon at http://coastalmap.marine.usgs.gov/GISdata/basemaps/coastlines/nos80k/nos80k.zip.

In trunk in r14764 and in branches/1.5 in r14765, I've added the possibility of defining OGR_ORGANIZE_POLYGONS=SKIP/ONLY_CCW to speed up (or skip) organizePolygons() in case the number of polygons is huge. ONLY_CCW only tries to classify counter-clock wise polygons. In case of shapefiles, those are holes. On the test case, this is very efficient.

So, I think I'm going to close the bug now. Really no idea how I could improve the situation.

By the way, the only software I've found that could display the polygon quickly was Mapserver. QGIS crashes my X-Server. OpenEV takes huge amounts of memory. And some demo version of proprietary software get stucks in opening it.

comment:10 by richengel, 16 years ago

I like the CCW only approach, and tested it on nos80k, another large data set, and my simple 4 level polygon - all worked properly. Is ther a way to make that the default behavior?

comment:11 by Even Rouault, 16 years ago

To make ONLY_CCW the default, just define in the environnment OGR_ORGANIZE_POLYGONS=ONLY_CCW, or pass "--config OGR_ORGANIZE_POLYGONS ONLY_CCW" on the command line of any gdal/ogr command line utility, or use CPLSetConfigOption("OGR_ORGANIZE_POLYGONS", "ONLY_CCW") from a C/C++ application, or gdal.SetConfigOption from python ;-)

I prefer not make it the default since the purpose of the organizePolygons() function is to deal with situations where parts are in random order and the notion of CW/CCW doesn't apply.

comment:12 by Even Rouault, 16 years ago

Note : The latest commits done for #1217 (that are in 1.5.3 and 1.6.0dev) make the ONLY_CCW mode the default mode for reading shapefiles. So no need to specify an esoteric configuration option now, and we should be back very close to GDAL < 1.5 performance

comment:13 by Even Rouault, 16 years ago

Milestone: 1.5.31.5.4

In r15517 in trunk I've commited another performance fix for ONLY_CCW mode when a polygon has only one outer ring. This case is hit for example for shapefiles made from VMAP data.

I've backported it in r15519 in branches/1.5, but a bit too late for 1.5.3-RC1. So it will be probably in 1.5.4

Note: See TracTickets for help on using tickets.