#2428 closed defect (fixed)
shapefile crash on huge polygon
Reported by: | richengel | Owned by: | Even Rouault |
---|---|---|---|
Priority: | normal | Milestone: | 1.5.4 |
Component: | OGR_SF | Version: | 1.5.1 |
Severity: | normal | Keywords: | shape |
Cc: |
Description
Attempting to read shapefile crashes in OGRGeometryFactory::organizePolygons for a particular shapefile. The shapefile has 1 polygon with 2514199 vertices and 43688 parts, according to shpdump. Get SIGABRT from new[] on line #806 in ogrgeometryfactory.cpp. Here i = 23246, so it apparently exhausted available memory in building the relations table. Testing done with a 32 bit version.
Change History (13)
comment:1 by , 16 years ago
Owner: | changed from | to
---|
comment:2 by , 16 years ago
I'm curious to see what such a huge polygon looks like : many islands, many holes in a big polygon ? Would it be possible to make it available to download ?
comment:3 by , 16 years ago
The feature is probably better characterized as a multipolygon. One large polygon consists of the boundary of the 48 contiguous US states. There are a few lakes and a large number of islands near the shore. The handling of the Great Lakes is somewhat strange. The border here is is sort of a partial loose-fitting convex hull which includes the lakes entirely inside the big polygon. There are islands in the Great Lakes, but none seem to have their own lakes. I attempted to attach a gzipped tar file, but it was too big (27 M). Perhaps an FTP site would help.
comment:4 by , 16 years ago
Yes 27MB is too big for attachments in a ticket here. I would definitely be interested in having your file as a stressing test for my attemps at improving the algorithm. If you can't upload it on a FTP server, maybe you can try to send it to this yahoo email account : even dot rouault at yahoo dot fr (no guarantee that it accepts big attachments)
My ideas for the fix is in two steps :
- improve the way the current algorithm works so we can use just O(N) memory, instead of O(N*N). However, the computation complexity will still by O(N*N). For N=46388, I'm afraid that will be painfully slow.
- add a simplified version of the algorithm that relies heavily on the order of the part. If the following constraints are respected, we can hope linear time. Let's define B <= A, as meaning B is enclosed in A (== B is a lake of A). The constraints are :
(1) An enclosing polygon A must be placed in the list before its enclosed polygons B1, B2... (2) An enclosed polygon B of a polygon A cannot be after a not related polygon C. (3) The maximum enclosing depth level is 1, that is to say there are no situations like C <= B <= A
A bit of ASCII art to make things clearer :
XXXXXXXXXXXXXXXXXXXX XXXXXXXXXXXXXXXXXXXX X X X X X A X X C X X X X X X XXXXXX XXXXXX X X XXXXXX XXXXXX X X X X X X X X X X X X X X X B1 X X B2 X X X X D1 X X D2 X X X X X X X X X X X X X X X XXXXXX XXXXXX X X XXXXXX XXXXXX X X X X X XXXXXXXXXXXXXXXXXXXX XXXXXXXXXXXXXXXXXXXX
The acceptable inputs for the simplified version of the algorithms are : (A,B1,B2,C,D1,D2) or (A,B2,B1,C,D1,D2) or (A,B1,B2,C,D2,D1) or (A,B2,B1,C,D2,D1) or (C,D1,D2,A,B1,B2) or (C,D1,D2,A,B2,B1) or (C,D2,D1,A,B1,B2,) or (C,D2,D1,A,B2,B1) And that's all I think !
In your description of how your multipolygon looks like (if I've well understood, what I'm unsure about the Great Lakes), I'm afraid that we cannot rely on (3), sad... (3) is a necessary condition to guaranty linear performance, but we could probably leave without it at the expense of more computations. However (1) and (2) are absolutely necessary. I'd like to be sure that your test case at least respect (1) and (2) before trying to implement that.
comment:5 by , 16 years ago
In trunk, in r14727, I've commited an improved version of the current algorithm that only consumes O(N) bytes of memory. Speed is probably also improved by a pre-sort by area.
comment:6 by , 16 years ago
Thanks, I'll test the fix with my data. The file was too big for my email to send and I haven't found a suitable public ftp drop site. Your reading of my comments on the Great lakes is correct, we have 2 levels of nesting there.
comment:7 by , 16 years ago
Component: | default → OGR_SF |
---|---|
Keywords: | shape added |
Milestone: | → 1.5.3 |
Fix backported in branches/1.5 in r14734
comment:8 by , 16 years ago
I tested on my problem data and this version now works. A couple of comments. Wasn't the purpose of organizePolygons to handle multiple levels of nesting?
I did some timing tests on a couple of data sets. I conditionally disabled the block for organizePolygons in favor of the subsequent block
OGRGeometryFactory::organizePolygons does not require GOES anymore else if( CSLTestBoolean(CPLGetConfigOption( "OGR_FULL_POLY", "NO" ) ) )
My test reads all the features in a layer. It took 1575 s elapsed for the organizePolygons, while it took 11.4 s for the subsequent block on the problem data set (nos80K). For another data set (2240 features, 4433 parts, 534322 points, with most in one large polygon with a lot of holes) it took 16.2 s for theorganizePolygons block and 0.1 s for the subsequent block (mideast). nos80k has islands in lakes and these are seen with both branches. organizePolygons only seems to address ponds in islands in lakes situation.
comment:9 by , 16 years ago
Resolution: | → fixed |
---|---|
Status: | new → closed |
I finally managed to find and download this famous huge polygon at http://coastalmap.marine.usgs.gov/GISdata/basemaps/coastlines/nos80k/nos80k.zip.
In trunk in r14764 and in branches/1.5 in r14765, I've added the possibility of defining OGR_ORGANIZE_POLYGONS=SKIP/ONLY_CCW to speed up (or skip) organizePolygons() in case the number of polygons is huge. ONLY_CCW only tries to classify counter-clock wise polygons. In case of shapefiles, those are holes. On the test case, this is very efficient.
So, I think I'm going to close the bug now. Really no idea how I could improve the situation.
By the way, the only software I've found that could display the polygon quickly was Mapserver. QGIS crashes my X-Server. OpenEV takes huge amounts of memory. And some demo version of proprietary software get stucks in opening it.
comment:10 by , 16 years ago
I like the CCW only approach, and tested it on nos80k, another large data set, and my simple 4 level polygon - all worked properly. Is ther a way to make that the default behavior?
comment:11 by , 16 years ago
To make ONLY_CCW the default, just define in the environnment OGR_ORGANIZE_POLYGONS=ONLY_CCW, or pass "--config OGR_ORGANIZE_POLYGONS ONLY_CCW" on the command line of any gdal/ogr command line utility, or use CPLSetConfigOption("OGR_ORGANIZE_POLYGONS", "ONLY_CCW") from a C/C++ application, or gdal.SetConfigOption from python ;-)
I prefer not make it the default since the purpose of the organizePolygons() function is to deal with situations where parts are in random order and the notion of CW/CCW doesn't apply.
comment:12 by , 16 years ago
Note : The latest commits done for #1217 (that are in 1.5.3 and 1.6.0dev) make the ONLY_CCW mode the default mode for reading shapefiles. So no need to specify an esoteric configuration option now, and we should be back very close to GDAL < 1.5 performance
comment:13 by , 16 years ago
Milestone: | 1.5.3 → 1.5.4 |
---|
I'll try to address that. Needs a bit of thinking to see how we can avoid the N2 memory consumption where N is the number of parts... (The number of vertices has no influence, except for speed maybe). I think that we'll have to switch back to more simplistic (thus approximative) algorithm when N > 100 for example.