Opened 9 years ago

Closed 9 years ago

#3023 closed enhancement (fixed)

Create ST_ClusterIntersecting / ST_ClusterWithin Functions

Reported by: dbaston Owned by: pramsey
Priority: medium Milestone: PostGIS 2.2.0
Component: postgis Version: master
Keywords: Cc: pramsey

Description

Create clustering functions described at http://trac.osgeo.org/postgis/wiki/DevClusteringFunctions

Change History (12)

comment:1 by robe, 9 years ago

Milestone: PostGIS 2.2.0

comment:2 by robe, 9 years ago

dbaston — I recall you said you had a patch written for this already? If so can you attach it to this ticket.

Details here:

http://postgis.net/development (look at bottom of page on Providing Patches)

comment:3 by dbaston, 9 years ago

I have code for ST_ClusterIntersecting….I'm in the process of migrating it into the PostGIS source tree (it was a standalone module). I'm planning to post that when it's ready, and use any feedback to develop ST_ClusterWithin.

comment:4 by dbaston, 9 years ago

I've posted an implementation of ST_ClusterIntersecting as a PR on github: https://github.com/postgis/postgis/pull/31

I just ran some unscientific benchmarks on my netbook, and I'm surprised by how good the performance is. Using some TIGER water lines ftp://ftp2.census.gov/geo/tiger/TIGER2014/LINEARWATER/tl_2014_50017_linearwater.zip, I'm getting clusters in 2-3 seconds, compared to 400-600 seconds using the two methods posted at http://gis.stackexchange.com/q/94203. Some of the lines that look like they _should_ intersect don't according to ST_Intersects(), so I actually had to buffer these inputs to get the result (watersheds) that I was after. Shows a need for ST_ClusterWithin…

Some things I am unsure of:

  • I'm always returning an array of GeometryCollections, instead of trying to figure out the most restrictive data type (for example, with LineString inputs, I could return an array of MultiLineStrings.) Always returning a GC makes the code simpler and the user always knows what to expect; on the other hand, this may require the user to convert the GCs into another type for downstream processing.
  • Adding clusterintersecting() to lwgeom_geos.h seemed a bit out of place?
  • I don't see the GEOS STRTree used elsewhere in the project, should I be accomplishing this some other way?

comment:5 by pramsey, 9 years ago

Cc: pramsey added

comment:6 by dbaston, 9 years ago

Discussed on IRC….nothing wrong with using GEOS STRTree per se. Aggregates can cause headaches for upgrades, since they can't be dropped if there are views that depend on them. Clustering functions could be changed to be be standard functions that accept geometry[] as an input type, if PSC wants to avoid creating new aggregates. User would call ST_ClusterIntersecting(ST_Accum(geom)) or similar.

comment:7 by robe, 9 years ago

dbaston,

Even though aggregates can be pains to upgrade, they are only pains if people use them in views or SQL functions (in sql functions, I'm not absolutely sure they are checked, though I suspect they are). That said if you think this function will never be embedded in a view, I wouldn't worry too much about it.

Have the array option is good anything though because most aggregate functions we implement have both an array version and an aggregate version. Take a look at ST_Union for example. http://postgis.net/docs/manual-2.1/ST_Union.html So we could start off with an array version and then create an aggregate later.

Two benefits I see with having an aggregate version 1) the outward facing use, in many cases will be easier for the end user to encorporate

2) You can leverage the state transitioning machinery of PostgreSQL aggregation like ST_Union, ST_Collect (and various raster aggregates do) to better handle a larger number of geometries. (this part I'm not sure about but pramsey, strk, or dustymugs would have a better clue if that applies here)

comment:8 by dbaston, 9 years ago

Hi robe,

I agree, and I think the simplicity benefits of defining aggregates is worth it in this case. I like the idea of also providing standard functions that accept geometry[], so I've added those into this version.

I also updated the pull request this evening to include ST_ClusterWithin(). One aspect I hadn't anticipated was the need to update pgis_abs to carry along a Datum (like distance tolerance in this case) to the finalizing function. Let me know if there is a better way to do this. (I couldn't find anything except maybe some new syntax for 9.4).

comment:9 by dbaston, 9 years ago

Owner: changed from dbaston to pramsey

Sending back to @pramsey to update milestone if needed. I can re-sync the pull request w/trunk when needed.

comment:10 by dbaston, 9 years ago

Any chance this can see some review before the 2.2 freeze? From the questions I see on Stack Exchange, these functions seem to have quite a few use cases.

comment:11 by dbaston, 9 years ago

Re-synced pull request w/trunk, updated to handle empty geometries.

comment:12 by pramsey, 9 years ago

Resolution: fixed
Status: newclosed

Wow, amazing patch, dropped on trunk at r13752, many thanks dbaston!

Note: See TracTickets for help on using tickets.