Opened 2 years ago

Closed 19 months ago

#5268 closed defect (wontfix)

ST_ShortestLine return invalid linestring when distance of geometries is 0

Reported by: latot Owned by: pramsey
Priority: medium Milestone: PostGIS 3.4.0
Component: postgis Version: 3.3.x
Keywords: Cc: pramsey

Description

Hi all, I was using ST_ShortestLine, it wasn't until I was trying to extend I notice there was some linestrings that I wasn't able to extend, after checking it I notice this happens when two geometries touches or is one over the other, in this cases ST_ShortesLine return an invalid linestring:

` SELECT ST_AsEWKT(ST_3DShortestLine(line,pt)) AS shl3d_line_pt,

ST_AsEWKT(ST_ShortestLine(line,pt)) As shl2d_line_pt

FROM (SELECT 'POINT(0 0 1)'::geometry As pt,

'LINESTRING (0 0 0, 0 0 2)'::geometry As line

) As foo;

LINESTRING(0 0 1,0 0 1) LINESTRING(0 0,0 0)

SELECT ST_AsText( ST_ShortestLine(

'POINT (0 1)', 'LINESTRING (0 0, 0 2)') ) As sline;

LINESTRING(0 1,0 1) `

When two elements touches or belongs to each other, I think the behavior of ST_ShortestLine should be different, not return that weirds linestrings.

Options, return null (? technically, if the distance is 0, does not exists the shortest line.

Thx!

Change History (9)

comment:1 by robe, 2 years ago

Milestone: PostGIS 3.3.2PostGIS 3.4.0

This is a hard-call. It's not something we can change in a micro since technically it's not wrong, just invalid. AS to whether we should change it in master (upcoming 3.4.0) is another story.

We have other functions that return invalid geometries. So we don't have a policy of not returning invalid geometries.

e.g.

SELECT ST_AsText(ST_MakeLine('POINT(0 0 1)'::geometry, 'POINT(0 0 1)'::geometry));

and as I recall ST_Collect does too for multipolygons.

So if we say this function shouldn't return an invalid geometry, what other functions would we make the same argument for, or would we make it for all of them.

comment:2 by latot, 2 years ago

ufff, yeah, now giving it more thoughts, is a hard-call as you says, I start writing and there I notice, the post was big!

My first thought is, the return should not be invalid geometries, or at least, not mix them, is pretty non-intuitive to get weird things from the functions without an easy way to know this cases, this could be different if in the docs is written every case of the invalid geoms that can be inside.

Check this, I use a lot zero/null vectors in this post. https://www.cuemath.com/calculus/zero-vector/

A trivial start point, we use functions to achieve something, so, in order to do that we expect a clear way to do that, if we know the invalid geometries, and why is every one of them, we can work with them, and do the right treatment do every case to achieve what we want, but, have all mixed causes we need to think all from zero, for all invalid cases, and have luck to have all of them, practically, invalid geometries is the same as have multiples types of data mixed.

So, the point, if you think I'm going in a good direction, is how return in a clear way the results, lets pick as example the actual case, ST_ShortestLength.

For now we have two types of results, vectors and the null vectors (length 0), is a little weird if you think a null vector is a invalid geometry, while it is a valid geometry (in the math :)).

But moving to a more practical space, my opinion is the difference is in the properties, the null vector does not has the same properties as a non null vector, and we don't want to every one get too deph in maths, usually is enough if you what go inside the function, and know what can be the output.

We expect from a vector to have at least a start point, and end point, and both to be different or have length more than 0, so is good to know if there is something in the result that does not have this conditions.

Oks, maybe, "is too complex", and you are right, if the definitions are not set or not enugh, but I think can helps to clarify a lot of things and the interactions, is obvs how a non null vector works with ST_Buffer, but, a null vector, what would be the effect of ST_Buffer on it? is not trivial with null vectors, if ST_Buffer works over a linestring the buffer of a null vector is it self, with no expansion because there is no linestring to use or calc, someone my say, but there is the points! yeah!, but you are buffering a 2 dim element! (there is lines in 3d and more, but still the element is 2d).

Checking, this can be a invalid geometry too, buffer should do nothing on it: ` SELECT ST_Buffer(

ST_GeomFromText(

'LINESTRING(50 50,50 50)'

), 10);

`

Someone still can says, check the docs of ST_Buffer: ` Computes a POLYGON or MULTIPOLYGON that represents all points whose distance from a geometry/geography is less than or equal to a given distance. `

And again, with a null vector, there is no distance to calculate from the vector it self :) We use points to represent linestrings, but that points are not the linestring.

All this interpretations depends on the expected properties and definitions of the geometries.

As I have written, there is 3 things, work with standard geometries, non-standar geometries (but valids, like null vectors), and the last one, is get better definitions for "standard" with the expected properties and "valid"/"invalid" geometries.

My recommendation in the properties issue, is try to keep them separate when the behavior of them is too different, like the non-null vector, and null vectors. If this separations are done correctly, will be more easy to handle every case and the behaviors between them without need to force it and keep it clear.

Maybe I put too much thoughts for one post :3

I agree, would be very hard to check one by one :)

But step by step is possible to improve.

in reply to:  2 comment:3 by robe, 2 years ago

Replying to latot:

A trivial start point, we use functions to achieve something, so, in order to do that we expect a clear way to do that, if we know the invalid geometries, and why is every one of them, we can work with them, and do the right treatment do every case to achieve what we want, but, have all mixed causes we need to think all from zero, for all invalid cases, and have luck to have all of them, practically, invalid geometries is the same as have multiples types of data mixed.

So, the point, if you think I'm going in a good direction, is how return in a clear way the results, lets pick as example the actual case, ST_ShortestLength.

For now we have two types of results, vectors and the null vectors (length 0), is a little weird if you think a null vector is a invalid geometry, while it is a valid geometry (in the math :)).

But moving to a more practical space, my opinion is the difference is in the properties, the null vector does not has the same properties as a non null vector, and we don't want to every one get too deph in maths, usually is enough if you what go inside the function, and know what can be the output.

We expect from a vector to have at least a start point, and end point, and both to be different or have length more than 0, so is good to know if there is something in the result that does not have this conditions.

I agree we should document this behavior as it's not intuitive. But neither to me is the assumption that they should be different points. In my mind it would return what it does now or just the single point.

The reason is the ST_ShortestLine is a line with the shortest distance. Both a single point and a invalid linestring having two points the same have a length of 0. So it is the shortest distance. A null would return a null which isn't technically right because that would mean we don't know the distance, but we do, it's 0.

Oks, maybe, "is too complex", and you are right, if the definitions are not set or not enugh, but I think can helps to clarify a lot of things and the interactions, is obvs how a non null vector works with ST_Buffer, but, a null vector, what would be the effect of ST_Buffer on it? is not trivial with null vectors, if ST_Buffer works over a linestring the buffer of a null vector is it self, with no expansion because there is no linestring to use or calc, someone my say, but there is the points! yeah!, but you are buffering a 2 dim element! (there is lines in 3d and more, but still the element is 2d).

Checking, this can be a invalid geometry too, buffer should do nothing on it: ` SELECT ST_Buffer(

ST_GeomFromText(

'LINESTRING(50 50,50 50)'

), 10);

`

Someone still can says, check the docs of ST_Buffer: ` Computes a POLYGON or MULTIPOLYGON that represents all points whose distance from a geometry/geography is less than or equal to a given distance. `

Here again the solution of the shortest line makes perfect sense. The shortest line is a line that doesn't take you anywhere aka the same point, and thus the result of buffer is equivalent to the buffer of the single point, an approximation of a circle.

A very long time ago ST_ClosestPoint didn't exist. ST_ShortestLine came first. And how ST_ClosestPoint was formed was by taking the second point of the ST_ShortestLine. I forget if that's still the case. But again the invalid linestring fits the answer still. You take one of the points and that is the closest point. It just happens to be the same point on both objects cause they intersect there.

And again, with a null vector, there is no distance to calculate from the vector it self :) We use points to represent linestrings, but that points are not the linestring.

All this interpretations depends on the expected properties and definitions of the geometries.

My expected definition is that all operations I apply on it still make sense and an invalid geometry is fine if it still makes sense when I apply other operations to it. a shortest line still returns a line, just an invalid line. So it still gives me a line back instead of a point, which would have been my alternative answer.

I'm not quite sure what you mean by a NULL vector. There is only an empty geometry (which is not null) and the SQL NULL. Neither seems right to me in this case, because both would fail the assumption I have that the closest point between two geometries should be the points that define the shortest line.

NULL should only be used if the answer is truly unknown. In this case, it's known, but a sometimes inconvenient answer cause it messes up your other processes. Then again returning NULL would mess up someone else's processes who is expecting a linestring, like mine e.g. it would make it difficult to distinguish an unsolvable problem from a problem with an inconvenient answer.

The only reason to dislike invalids is because they crash your code or just make it behave in ways you don't like.

1) Some people never want an invalid geometry answer. This is actually something my friend Paul Norman mentioned, and he has to filter these out of his OpenStreetMap processing and it just slows the whole process down.

2) Sandro (strk) had mentioned maybe we should have some sort of global GUC variable that one can set to say "When the answer is invalid, give me NULL back"

But GUCs are pesky things in their own right and require you to know they exist

3) The third option proposed, might have been Paul Norman again that proposed it as he was only bothered by one function, which for the life of me I can't remember. Is to add an additional argument to the function to denote the desired behavior.

We settled on such a thing with ST_MakeValid - https://postgis.net/docs/ST_MakeValid.html

Benefit of 3:

is that it doesn't break code expecting the old behavior, but allows for an alternative behavior.

Drawbacks a) The downside of such an approach is that it requires us to rip out the old function and replace with a new one that has default args, which becomes an unpleasant issue for upgrade if people built views and materialized views against the function

or

b) We have to add yet another function, which increases our technical debt and footprint.

As I have written, there is 3 things, work with standard geometries, non-standar geometries (but valids, like null vectors), and the last one, is get better definitions for "standard" with the expected properties and "valid"/"invalid" geometries.

My recommendation in the properties issue, is try to keep them separate when the behavior of them is too different, like the non-null vector, and null vectors. If this separations are done correctly, will be more easy to handle every case and the behaviors between them without need to force it and keep it clear.

Maybe I put too much thoughts for one post :3

I agree, would be very hard to check one by one :)

But step by step is possible to improve.

Yap we should definitely document these irregularities and flag them as warnings and unexpected answers. That would be a good start. As to what to do about them as we see above gets into a philosophical discussion that may never end.

Last edited 2 years ago by robe (previous) (diff)

comment:4 by robe, 2 years ago

Cc: pramsey added

@latot

Forgot to mention — somewhat related to this. Paul Ramsey has been looking at s2. I'm not sure you saw his comment on PostGIS irc channel

'm actually working on postgis_s2 right now
11:19AM
In reply to 
R
Regina Obe
pramsey
I'm actually working on postgis_s2 right now
I see neither of the earlier attempts got very far... I got quite a ways on a non-postgis implementation, but in the end to be useful a binding needs a full set of real geometric types, the pgsql native ones just don't hack it. Also missing is a good way to do indexing... you end up either inventing something much like what is already in postgis geography or something similar using Caps. Ironically/oddly it's well nigh impossible (I've found) to leverage the S2 cell id into something useful for general purpose index assistance in the database. You end up needing to handball it a lot for many cases.

one "issue" with s2, which is or isn't an issue depending on your perspective: it's truly a spherical library. not a geodetic library. so if you ask for distance, you get the spherical distance, not the spheroidal distance. if you do constructive geometry, you get the spherical construction

maybe I should push my non-postgis implementation to github. it's cute, I'm proud of the bindings 🙂

@pramsey how's that pushing coming along?

comment:5 by mdavis, 2 years ago

To my mind returning a zero-length LineString makes perfect sense in this situation. So the real problem is whether zero-length LineStrings should be invalid. The Simple Features spec appears to mandate this - but perhaps it's not entirely clear, or is simply not well-thought out in light of practical usage.

comment:6 by latot, 2 years ago

Hi!

Sorry for the confusion, I should had used the same nomenclature everywhere instead of mix them, zero-vectors, some times are called null vectors too, is not a SQL null, is a Math null.

If Simple Features spec already has a definition for valid/invalid features, I think is fine follow it. But (always a but), if things like zero-vectors are considers invalid, I really, really really prefer postgis to return them, there is analysis where we need that invalid geometries, and the type of geoemtry, so return them as null is not great at all.

The invalid features, are not always useful by it self, but is very useful recognize them, usually they need other algorithm to they do what we want them to do.

Maybe, instead go to too deep definitions, we can explore in the functions, if there is edge cases write a base catalogue, and how their properties can affect the calculations in the function, like a start point, the zero-vectors and non-zero-vectors.

I don't like the idea of send a param to do something with non-valid geometries, the reason, the treatment to the invalid geometries depends of what we want to achieve, so, is like there will be no standard for it. I think the better is a way to just recognize them, and let the ppl do what they want to do with them.

I'll just imagine…, maybe there is a better way, the example is inspired in a real work I did some time ago, there is always several ways to do it, but the next query is pretty easy to read, and know the type helps to work better! (just a illustrative example :)):

`

SELECT

CASE

WHEN stl.type = 'zero-vector' THEN

ST_ShortestLine(ST_OffsetCurve(table.geom1, 1), table.geom2)

ELSE stl.geometry

FROM table LATERAL (ST_ShortestLine(table.geom1, table.geom2)) stl — Two columns, geometry, type

`

Personally, would be great something like this, maybe a param to trigger this return type? Can handle one by one, really simplifies everything, yeah more lines, and less errors.

Thx!.

Version 0, edited 2 years ago by latot (next)

in reply to:  6 comment:7 by robe, 2 years ago

Replying to latot:

The invalid features, are not always useful by it self, but is very useful recognize them, usually they need other algorithm to they do what we want them to do.

Maybe, instead go to too deep definitions, we can explore in the functions, if there is edge cases write a base catalogue, and how their properties can affect the calculations in the function, like a start point, the zero-vectors and non-zero-vectors.

I don't like the idea of send a param to do something with non-valid geometries, the reason, the treatment to the invalid geometries depends of what we want to achieve, so, is like there will be no standard for it. I think the better is a way to just recognize them, and let the ppl do what they want to do with them.

I'll just imagine…, maybe there is a better way, the example is inspired in a real work I did some time ago, there is always several ways to do it, but the next query is pretty easy to read, and know the type helps to work better! (just a illustrative example :)):

`

SELECT

CASE

WHEN stl.type = 'zero-vector' THEN

ST_ShortestLine(ST_OffsetCurve(table.geom1, 1), table.geom2)

ELSE stl.geometry

FROM table LATERAL (ST_ShortestLine(table.geom1, table.geom2)) stl — Two columns, geometry, type

`

Personally, would be great something like this, maybe a param to trigger this return type? Can handle one by one, really simplifies everything, yeah more lines, and less errors.

Thx!.

Edited:

Checking this more, part of this can be achieved with something like ST_IsNull, but, I have two concerns

One of them is if there is done too much ST_IsProperty_Name, check every one of them can be harder, so I think would be great has other way to know the properties of the vector.

Second is performance, know the properties is good, someone may say that some times is not necessary and is true, but there is some times when we need it, and, there is the case you think is not necessary just you wasn't able to consider all the edge cases, and is right. So a way to can know this without affect too much would be great, I'm thinking in the case written above, where ST_MakeValid has impact, but we don't need it all the time, all is case by case.

One thing Paul Norman was complaining about is that CASE statements are relatively slow. So yah his code has a lot of case to handle invalid geometries, but would be better if maybe it was handled in the function, the performance would be better than resorting to a CASE statement. In theory anyway.

comment:8 by latot, 2 years ago

That is a good point, there is several things, first, maybe in the Paul Norman case, remove or make them valid is way for what he want, but is not all the cases, so, try to introduce new features there, with params and custom functions, can end with too complex features, for devs and the ones who writes the queries, maybe, is better just accept the CASE price.

Due to this two reasons I think handle this from the params is not the way, at least to keep the flexibility and simplicity.

Maybe for that particular case of transform them to valid from a param, is possible to make them, but will not solve the problem.

I think the point is how to catalogue the data in a efficient way, and for every type do the proper thing, that would be the challenge,. sadly I don't have the enough knowledge of postgres for good ideas (when they can be impossible to do).

Ideally, can create groups, a function where we can split the results in the types we want, split by properties, with the groups done, some way (that I don't know how to do it) apply something to every group. Due to this "function that groups and categorize" is handle from Postgis we can keep the efficiency.

I think the "valid geometry" definition can be overwhelming, too technical math, is not bad, but some times, too hard to conceive if we want them or not, too easy to mislead something, or apply a bad analysis, the definition is good, but not for all ppl.

Example, some times in roads we don't have the height, that causes to represent some roads with self intersecting, is a valid geometry (road) for the case? Yes. Is a valid geometry by SF? No, I want to keep them as invalid?, yes, make it valid and create a node will create a non-representative road.

comment:9 by pramsey, 19 months ago

Resolution: wontfix
Status: newclosed

A zero-length linestring is a reasonable return, and having the return type be more homogeneous (not a point, not a null) makes processing the output simpler. Just check the length before you attempt to extend it.

Note: See TracTickets for help on using tickets.