Opened 15 years ago

Closed 11 years ago

#426 closed enhancement (fixed)

v.in.ogr: split long boundaries

Reported by: neteler Owned by: grass-dev@…
Priority: major Milestone: 6.5.0
Component: Vector Version: unspecified
Keywords: Cc:
CPU: All Platform: All

Description

Moved here from http://trac.osgeo.org/grass/browser/grass/trunk/doc/vector/TODO#L242

Radim suggested:

It would be useful to split long boundaries in v.in.ogr to smaller pieces. Otherwise cleaning process can become very slow because bounding box of long boundaries can overlap large part of the map (for example outline around all areas) and cleaning process is checking intersection with all boundaries falling in the bounding box.

Attachments (1)

v.in.ogr.splitbndries.patch (11.4 KB ) - added by mmetz 15 years ago.
There was a nasty bug in the patch, patch replaced

Download all attachments as: .zip

Change History (13)

comment:1 by mmetz, 15 years ago

No boundary splitting is only one aspect that leads to complaints about v.in.ogr. I have added boundary splitting to my local copy of develbranch_6 and found sometimes substantial speed improvements (up to 6x as fast), sometimes no speed improvement (more or less the same, give and take a few seconds). It depends on properties of the vector to be imported that are independent of the size of the vector to be imported and the number of features in it (see below). I have used ESRI Shapefiles for testing, they are the most common vectors to be imported and they have only what is called "simple features", most importantly no topology.

I figured out four possibilities to improve (ESRI Shapefile) vector import with v.in.ogr

1) code comment quoting "TODO: is it necessary to build here? probably not, consumes time"

Indeed, build partial with GV_BUILD_BASE is sufficient and results in a substantial speed improvement for large vectors.

2) when area cleaning is desired (no -c flag), support for the output vector is released, the vector is closed, opened for update and a partial build with GV_BUILD_BASE is done

This gives me an error with large vectors: the size of the coor file does not match the size given by topology (the topo file). I disabled that part of the code and now it works. There must have been a reason to do that but I can not think of a reason. But then I'm not that deep into GRASS vector processing, please help me out. My argument is that v.in.ogr is far from finished with that vector and that it is safer to keep it open and keep working on it. For me, this is both a speed improvement and avoids import failures (most important).

3) split boundaries when import vectors have many areas ( > 500)

I make this No 3) because No 2) is crucial, it avoids import failures, and No 2) depends on No 1). Splitting boundaries is just another speed improvement but probably welcome for many users because the speed improvement can be quite a bit. With my proposed method, splitting boundaries is done with a threshold for boundary length. Whenever a new vertex is added to a boundary and the boundary length exceeds that threshold, the boundary is written out and a new boundary started with the same Cats if given. The reasoning to determine the threshold is that a useful threshold is a function of vector area size and the number of areas. I propose to use map unit / ln(features). Map unit is sqrt(area size), reasonable for boundary length. For bounding box of boundary length I would use are size directly (keep units identical). Using ln(features) avoids creating tiny tiny thresholds when many many features are in the vector to be imported. I would undertand if you think that thid is nonsense but it works, really! Both for a global map of the world with political boundaries in latlon and a vector with watershed basins in UTM with 150x150 km extends. Anyway, splitting boundaries will only happen f the -s flag is set (keeping compatibility with 6.4.0).

4) use a temporary vector, not by me but by Radim Blazek

Do all the processing and cleaning in a temporary vector, then copy only alive lines to the output vector. In case of ESRI Shapefile polygon import, this might reduce the size of the coor file by a factor of 2 (all boundaries used by GRASS are present twice in the shapefile). That would not only be a speed improvement for further processing but also be safer. Thinking about it that should be No 1) because as Radim Blazek suggested, every module should do that to keep the size of the coor file small and speed up vector processing.

My ultimate testing shapefile that I referred to above has a size of 4.6 MB and 3421 polygons. I would laugh at that and expect seconds to import it. Then I noticed that the coor file created by GRASS is 4.5 GB = 4608 MB large (no typo). That is the size after reading in all boundaries, before cleaning. I'm not done yet with cleaning and am confident that the size will go over 5 GB, the possible maximum size should be at least below 8 GB. If anybody out there gets as far as "Remove duplicates:" with the current v.in.ogr version in 6.4.0.RC3 or devbr_6, I will be ready to deliver a substantial price. Let's say I pay your next pizza delivery :-) I will send that shapefile on request.

This innocent looking little shapefile has one polygon with thousands of islands, that one is responsible for the large coor file and the long time needed to import that shapefile. I will manage to import it (some more hours later, still busy) and thus prove that my suggested improvement No 2) does indeed make sense.

Markus M

in reply to:  1 ; comment:2 by cgsbob, 15 years ago

In Ticket #397 I reported that some large holes are being filled after going through v.clean. This seems to happen when this large hole is surrounded by a large and complex polygon. Do you think that your v.in.ogr might solve my problem?

in reply to:  2 ; comment:3 by mmetz, 15 years ago

Replying to cgsbob:

In Ticket #397 I reported that some large holes are being filled after going through v.clean. This seems to happen when this large hole is surrounded by a large and complex polygon. Do you think that your v.in.ogr might solve my problem?

"My" v.in.ogr is not any different from "your" v.in.ogr in that respect. You can use "v.in.ogr min_area=2800", but I don't think it's a good idea to export your vector and then re-import it again, you may well loose some information on the way.

in reply to:  3 comment:4 by cgsbob, 15 years ago

Replying to mmetz:

Replying to cgsbob:

In Ticket #397 I reported that some large holes are being filled after going through v.clean. This seems to happen when this large hole is surrounded by a large and complex polygon. Do you think that your v.in.ogr might solve my problem?

"My" v.in.ogr is not any different from "your" v.in.ogr in that respect. You can use "v.in.ogr min_area=2800", but I don't think it's a good idea to export your vector and then re-import it again, you may well loose some information on the way.

I agree, but as you can see in ticket #397, I have not been able to clean up my vector. I was just saying that if I did re-import my vector into v.in.ogr w/ boundary splitting and then used v.clean tool=rmarea, I might get something better then a filled large hole.

comment:5 by mmetz, 15 years ago

The attached patch is against devbr_6 r35761 and would give a bit of a speed up and adds safety checks.

I added boundary splitting activated with a new -s flag, the speed increase is zero to 6x, depending on the import vector.

The patch uses a temporary vector, this can reduce final coor file size by a factor of 2 to 5 if a shapefile with polygons is imported.

Vector libraries do not support large files > 2GB, therefore my complaint no 2) above is not valid and the error message is valid.

Closing a vector and opening again works like a file size limit check. The patch does that not only before cleaning polygons as in the current v.in.ogr (this is a very nice idea, it would be annoying to get a vector cleaned for hours that was corrupt anyway...), it also checks file size limits just before copying the temporary vector to the final output vector. If that check is passed, only alive features of the temporary vector are copied to the output vector, giving the file size reduction in case polygons were cleaned. New warning messages appear when checking file size limits (checking file size limits is done by the vector library when opening a vector, I didn't come up with a weird new solution:-)).

Please test!

Markus M

by mmetz, 15 years ago

Attachment: v.in.ogr.splitbndries.patch added

There was a nasty bug in the patch, patch replaced

comment:6 by mmetz, 15 years ago

After lots of testing and debugging with the help of Markus Neteler, I would like to add boundary splitting in trunk and maybe in devbr6, but want to ask first about how to implement it:

1) enable boundary splitting by default, disable with new flag. That's my favorite.

2) enable boundary splitting with new flag, don't split by default.

3) always split boundaries, no option to disable.

4) forget about it, leave v.in.ogr as it is.

If boundary splitting gets added to v.in.ogr, the man page would be updated with a hint to use v.build.polylines if boundary segments should be joined again.

With boundary splitting, v.in.ogr is roughly 2x (1.5x - 3x) faster on shapefiles with areas. That's not as much as I hoped for, but something.

Please vote :-)

Markus M

in reply to:  6 ; comment:7 by mlennert, 15 years ago

Replying to mmetz:

After lots of testing and debugging with the help of Markus Neteler, I would like to add boundary splitting in trunk and maybe in devbr6,

Great job !

but want to ask first about how to implement it:

Before being able to answer this it would help to know what the impacts of boundary splitting on any other vector operations might be. Speeding up import is good, but if it causes any disruptions (or slow-downs) further on, it might be preferrable to disable it by default. If not I clearly vote for:

1) enable boundary splitting by default, disable with new flag. That's my favorite.

Moritz

in reply to:  7 comment:8 by mmetz, 15 years ago

Replying to mlennert:

Replying to mmetz:

want to ask first about how to implement it:

Before being able to answer this it would help to know what the impacts of boundary splitting on any other vector operations might be. Speeding up import is good, but if it causes any disruptions (or slow-downs) further on, it might be preferrable to disable it by default.

AFAICT there is one general trade-off: boundary splitting results in more boundaries -> more disk space needed and more memory needed, particularly for topology and the spatial index.

I did not thoroughly test any disruptions (or slow-downs) further on, so far everything works. I'm also not sure if the speed gain is lost when you have to run v.build.polylines afterward. This new feature needs more testing with regard to its impact on other vector operations, therefore my favorite is 1), submitting to trunk only and see how it performs.

Markus M

in reply to:  6 ; comment:9 by hamish, 15 years ago

Replying to mmetz:

After lots of testing and debugging with the help of Markus Neteler, I would like to add boundary splitting in trunk and maybe in devbr6, but want to ask first about how to implement it:

1) enable boundary splitting by default, disable with new flag. That's my favorite.

it would be nice to keep a method to import exact data, even if splitting will be the default.

how would it be split? after a certain number of vertices of a polyline like v.split or every "n" map units in a grid like v.in.gshhs?

note v.generalize currently requires to run v.build.polylines first to get correct output. I don't know if that is a feature or a bug. but if breaking polylines was the default for v.in.ogr it could become a widespread subtle problem for it.

Hamish

in reply to:  9 comment:10 by mmetz, 15 years ago

Replying to hamish:

it would be nice to keep a method to import exact data, even if splitting will be the default.

Do you mean, don't split when polygons are not cleaned (-c flag)?

how would it be split? after a certain number of vertices of a polyline like v.split or every "n" map units in a grid like v.in.gshhs?

I'm currently using line length in map units. The threshold is guestimated from feature density in each layer. BTW, v.in.gshhs restricts bounding box dimensions, similar to every "n" map units in a grid, but better. Using line length is faster than using max bbox dimensions.

I do the splitting during OGRFeature import in order to not to inflate the coor file size. Independent of splitting, after cleaning, more than 50% of the coor file are dead lines, IOW if you would copy only alive lines to a final output vector you could reduce the coor file size by more than 50%. Splitting so early means that final boundary segments are shorter than the threshold used because both "break polygons" and "break lines" break lines at intersections. That's the reason why I don't want to make splitting threshold a user option to avoid complaints.

The alternative would be to split after "break polygons" and "remove duplicates", just before "break boundaries". Boundaries would then have a max length of threshold, but this further inflates coor file size. The reason why I don't want to further inflate the coor file size with more dead lines is missing LFS support in the vector libs...

note v.generalize currently requires to run v.build.polylines first to get correct output. I don't know if that is a feature or a bug. but if breaking polylines was the default for v.in.ogr it could become a widespread subtle problem for it.

Hmm, I think v.build.polylines should be run first anyway for v.generalize. The cleaning process in v.in.ogr usually generates boundary segments independent of whether boundaries are split.

Markus M

comment:11 by mmetz, 15 years ago

Fixed in trunk r37790.

Boundaries are split when cleaning, splitting is disabled when not cleaning, no extra splitting flag. Additionally, boundaries are merged later on, splitting is only needed to make Vect_break_lines() faster. There is now a new cleaning tool in vector libs, Vect_merge_lines(). The resulting vector should be topologically (c)leaner (less nodes and ready to use with e.g. v.generalize) than grass6 imports.

Also new is the use of a temporary vector to reduce final coor file size by a factor 2 to 5 for area imports.

Despite the new boundary merging and use of a temp vector, area imports are considerably faster than with grass6, please test!

Markus M

comment:12 by mmetz, 11 years ago

Resolution: fixed
Status: newclosed

Applied to all branches, closing ticket.

Note: See TracTickets for help on using tickets.