Opened 11 years ago

Closed 11 years ago

Last modified 11 years ago

#231 closed enhancement (fixed)

New distance-calculations proposal

Reported by: nicklas Owned by: robe
Priority: medium Milestone: PostGIS 1.5.0
Component: postgis Version:
Keywords: Cc:

Description (last modified by robe)

A hopefully more structured ticket than #137

The patch attached includes:
new functions:

  1. st_shortestline
  2. st_longestline
  3. st_dfullywithin
  4. st_max_distance
    (this is really a fix to a long standing bug -- st_max_distance has existed since at least 1.2 but has never worked - (older versions give Not yet implemented))
  5. st_closestpoint
  6. st_furthestpoint

enhancements affecting st_distance, st_dwithin and the functions above:

  1. a better handling of "holes" in polygons.
  2. a sudden stop in iteration when the distance in st_dwithin is satisfied. Before it was checked for every subgeom.
  3. a new way of calculating distances between lines and polygons that don't intersect.


Probably the last one is the most important. On big geometries it makes a big difference.

Beside the patch I have attached a document where I try to explain the idea behind the distance-calculation.

/Nicklas

Attachments (4)

dist.doc (69.5 KB) - added by nicklas 11 years ago.
explanation of the distance calculation
distance.patch (82 bytes) - added by nicklas 11 years ago.
subgeomhandling.patch (65 bytes) - added by nicklas 11 years ago.
now included in spike/nicklas/distcalc
dist.odt (57.2 KB) - added by nicklas 11 years ago.
same doc with some typos removed and in odt :-)

Download all attachments as: .zip

Change History (44)

Changed 11 years ago by nicklas

Attachment: dist.doc added

explanation of the distance calculation

comment:1 Changed 11 years ago by robe

Nick,

Does this particular patch include the fix for ST_Max_Distance as well (i saw it in the patch you sent). I know you had it in a separate patch and I said to hold it separately since it was a fix to a long standing bug and not an enhancement. Though not sure how everyone feels about back-porting that particular fix to 1.4.1 since 1.4 and below has the function but gives annoying message "This function is unimplemented yet" so technically its not an API change -- its a bug fix.

Thanks, Regina

comment:2 Changed 11 years ago by nicklas

Yes, you are right, it includes st_max_distance too. Maybe you can put it in the description. I don't think I can. /Nicklas

comment:3 Changed 11 years ago by robe

Description: modified (diff)

comment:4 Changed 11 years ago by robe

Owner: changed from pramsey to robe
Status: newassigned

Nicklas,

I did some tests with this against my real data and it looks really good.

For example I did a test on the ST_DWithin for my 190,000 data set of parcels against my 15 boston neighborhoods and the answers from yours are the same as what I get with a PostGIS 1.4 install, but yours appears to be faster:

I ran the below query SELECT COUNT(*) As tot, n.name FROM neighborhoods As n INNER JOIN landparcels As l ON

ST_DWithin(n.the_geom, l.the_geom,0.001) GROUP BY n.name ORDER BY n.name

with my PostGIS 1.4 install -- takes --385719ms - 1.4 to run this query - with my PostGIS 1.5SVN patched with your distance takes -- -- 249859 ms

I also did some random samples of ST_Distance between each neighborhood and a given landparcel and the numbers agreed down to as many decimals as displayed (around 9 decimal places).

So from a cursory user standpoint, this patch looks really good. I'll try to do more extensive testings and test out the ST_Max_Distance. Since I don't have a regress that since its broken in older releases and new functions, I'll probably have to pull out a pen and paper for those or some plpgsql equivalent to test against it.

Mark,Paul, Kevin -- can you please look at Nicklas patch. I was afraid to look at the code since I'm not a good judge of c-programming, but if you think it looks fine and no glaring errors, I vote to accept it into 1.5 and consider back-porting (the speed and st_max_distance fix) to 1.4.

Thanks, Regina

comment:5 Changed 11 years ago by nicklas

Great

One thing to check is if we should use the old calculations for smaller geometries. I guess maybe my function gives more overhead-cost than gain. It also would be very easy to put in a rule where for instance npoints og geom1 * npoints of geom2 is more than 500 or something like that otherwise it just uses the old one. The real gain is big geometries far from eachother. Then I have had cases where the querytime reduses from 220000ms to jus about 2000 ms. Try the queries at your slide 19 you mentioned, Regina. I think the querytime is reduced by approx 90% at least eat some of them :-)

Now I'm working on a proposal for better handling of subgeoms. There is two good things about it. First, of course it will reduce querytime when there is many subgeoms involved in the geometries. st_distance(alaska, texas) in Reginas dataset takes about 7 seconds with my patch, but with this subgeoms handling it is under 1 sec.

But the other nice thing about it, is that it's usable innearest neighbour queries. just collect all geometries and get the closest point with st_shortestline. Then st_shortestline first arrange all incomming geometries like it would have been just any subgeoms. find the closest one and returns a line to the closest point of it.

I think it's a good thing. I have it running for st_shortestline and st_distance but I have to modifie it to also take st_max_distance and st_longestline.

comment:6 Changed 11 years ago by robe

Yah still have to pass it thru my torture tester. By the way -- for the sampling of 4 parcels I did against the 15 neighborhoods (60 records)

SELECT n.name, l.pid, ST_Distance(l.the_geom, n.the_geom) As dist from boston.nei As n CROSS JOIN boston.sample_land As l ORDER BY l.pid, n.name;

Yours completes (1.5 with dist patch) in 47 ms The postgis 1.4 (no dist patch) completes in 593 ms and returns the same answer.

So I'm pretty excited so far. Well we will see after I test against every permutation of geometry types. The above were just largish multipolygon against small polygon

Thanks, Regina

comment:7 Changed 11 years ago by nicklas

I have found some serous trouble but I hope we can fins some way around.

this is not working with the pach:

select st_distance('LINESTRING(9 5, 10 10, 10 20 ,9 25)'::geometry,'LINESTRING(6 17, 8 17)'::geometry);

select st_astext(st_shortestline('LINESTRING(9 5, 10 10, 10 20 ,9 25 )'::geometry,'LINESTRING(6 17, 8 17)'::geometry));

I don't know how I can have missed that case, it is very common. This is bad news but I think there should be some way around to find.

/Nicklas

comment:8 Changed 11 years ago by robe

Hmm I see. Strange. I ran my torture ST_Distance tests against this (but it wasn't really designed to catch corner cases). It did come up with when dealing with M and Z geometries, your distance is off by a decimal from the other. I wonder if this is some other difference between 1.4 and 1.5 and not specifically your changes.

See below query: In 8.4 1.4 gives -- 5.65685424949238

4.24264068711929

In 8.4 1.5 with your patch gives -- 5.65685424949238

4.24264068711928

But its not a significant difference: --ST_Distance LineSet3D(g1, g2): Start Testing MultiLineMSet, LINESTRING

SELECT ST_Distance(foo1.the_geom, foo2.the_geom)

FROM ((SELECT ST_Collect(s.the_geom) As the_geom

FROM (SELECT ST_MakeLine(ST_SetSRID(ST_MakePointM(i,j,m),4326),ST_SetSRID(ST_MakePointM(j,i,m),4326)) As the_geom FROM generate_series(-10,50,20) As i

CROSS JOIN generate_series(50,70, 25) As j CROSS JOIN generate_series(1,2) As m WHERE NOT(i = j)) As s)) As foo1 CROSS JOIN ((SELECT ST_SetSRID(ST_MakeLine(ST_MakePoint(i,j,k), ST_MakePoint(i+k,j+k,k)),4326) As the_geom

FROM generate_series(-10,50,20) As i

CROSS JOIN generate_series(40,70, 20) j CROSS JOIN generate_series(1,2) k ORDER BY i, j, i+j+k, k, i*j*k)) As foo2 LIMIT 2;

comment:9 Changed 11 years ago by nicklas

I think I can explain this difference. In my functions I always calculate a point along an edge, because it's needed in st_shortestline and then I push that point to the point-point distance calculation just the same way as if it would have been two vertexes. Before the distance was calculated to the edge without identyfying the point. Because of that there is a round at the last decimal to store the point and I guess that's why it differs there.

About the issue I mentioned earlier it seems to be fixable, but it will take a day or two before I have time for it.

/Nicklas

comment:10 Changed 11 years ago by nicklas

The work in this ticket is now found in :

http://svn.osgeo.org/postgis/spike/nicklas/distcalc/

there, the bug mentioned here 08/07/09 11:29:24 is now fixed.

This means the patch here is outdated.

/Nicklas

comment:11 Changed 11 years ago by robe

Nicklas, I took a quick look at your code. Your commenting style for function description is a little off so I don't think it will be picked up by our doxygen auto documentor. YOu probably made the mistake of copying some of the bad commenting style in the other parts of the code base. Please read below (which is still a work in progress).

http://trac.osgeo.org/postgis/wiki/DevWikiPatch

Paul if you are not asleep, can you fill in the section on CUnit Testing http://trac.osgeo.org/postgis/wiki/DevWikiMain

comment:12 Changed 11 years ago by nicklas

Yes, I saw the wiki-post yesterday and will take a look at it as soon as possible.

/Nicklas

Changed 11 years ago by nicklas

Attachment: distance.patch added

comment:13 Changed 11 years ago by nicklas

As discussed above in august the distance-calculation of today gives a slightly different answer in some cases than this new calculation.

here is an example:

select[[BR]]
st_distance('POINT(774983.955351382 2949487.72566304)'::geometry,[[BR]]
 'LINESTRING(775212.698335966 2948028.5564237,775121.04258857 [[BR]]
2950068.84568561)'::geometry)

in todays calculation this gives :
163.028426002135 meters, and in my calulation:
163.028426002146 meters

The thing is that the answer in the distance-calculations is given in double precision data type. in this case that gives room for 12 decimals but the coordinates that the distance is calculated from only has 8 or 9 decimals. So no one of the answers is right or wrong. The reason for the difference between the two ways of calculating is that in my calculation a point is calculated on the line and then is the distance calculated to that point instead of directly to the line in itself. That point also have the smaller number of decimals so the answer will get another round-of error than the calculatin of today.

But note, no one of the answers is right. The deciamals beyond the coordinates precision is worthless (besides they represent about the distance between the eyes of a bacteria if they had eyes) since they can't be calculated from the coordinates precision.

How to handle this. It is very small distances we are talking about but it can cause trouble in regression tests and if people is comparing distances on this last decimals.

I guess this is a common issue. Is it a problem? Have I understood the precision problem of floating data-types right.

comment:14 Changed 11 years ago by robe

For what its worth. The above example is using Mass State Plane feet (not meters), so from my perspective the difference is even less significant. An the exercise I did this on (150,000 parcel distance from a line, I think this accounted for 25000 records. The others were exactly the same).

I suspect it will change from platform to platform and possibly even the GCC version you compile with.

I think any regress test we do will take into consideration some sort of tolerance and we'll need to decide what that is. I was thinking to judge the rightness of each, we should probably run the same test against another spatial database like SQL Server or Oracle or IBM DB2 (preferably all 3) and see what their distance measurements come up with.

I can try the same exercise loading the data into SQL Server and possibly Oracle to see how far those are from these. I'm curious myself to see how far our distance calculations compare to other databases.

comment:15 Changed 11 years ago by nicklas

Yes, this will never be of practical importance. But I think it was a little bit interesting, I have never thought of that there is room for more precision in storing the geometries closer to the origo of the coordinate system than far away. If the "whole-unit part" of the coordinate takes 7 positions there is only 7 more to describe the decimals compared to 13 or maybe even 14 decimals, closer to origo than one unit.

I have played a little with the numbers and I don't see the logic how many decimals is returned in our calculations. Sometimes it seems like the same number is returned as inputed but not considering how many digits representing decimals and how many representing whole units.

Sometimes I get the whole, double precision answer, even if it is not needed. In this example this ccauses a little strange answer, I think:

select st_length('LINESTRING(10000 1,10000.1 1)'::geometry)

I get 0.100000000000364 instead of 0.1

What is deciding how many decimals should be returned? is it the planner or the postgis functions?

/Nicklas

comment:16 Changed 11 years ago by robe

Curiously I get the same answer in SQL Server 2008

SELECT Geometry::STGeomFromText('LINESTRING(10000 1,10000.1 1)',0).STLength();

yields: 0.100000000000364

Have no clue. But I guess it can't be something specific about PostgreSQL -- unless its some old code borrowed way back by the original Ingres team before split off to form (PostgreSQL/Illustra/Informix , Sybase -> SQL Server)

comment:17 Changed 11 years ago by pramsey

You're just bumping up against the internal maximum precision. 0.1 cannot be stored in whole numbers in IEEE double, so you end up with a (repeating, usually) decimal if you print out enough digits.

comment:18 Changed 11 years ago by nicklas

ok, but why does this give 0.1

select st_length('LINESTRING(1 1,1.1 1)'::geometry);

comment:19 Changed 11 years ago by pramsey

Perhaps someone will post the canonical examples of representation/precision issues. Like the same mathematical answer giving different representations depending on what order the operands are applied. Anyhow, if I tell you that 0.99999999, 1.0 and 1.00000000000001 all are the "same", does that give you a flavour? Different underlying ieee results that represent the same attempt at representing a double will spool out to different representations. Write up your own little test program that fires out 35 decimal places and try different little bits of math. Unfortunately, our string writing code is kind of all hand-balled into place. nice programmer rennovation might be to find all the places that print coordinates and build a standard API for them so we can control apparent output precision in one location.

comment:20 Changed 11 years ago by robe

Interesting SQL Server also agrees with our original distance calculation. I was really expecting the last set of digits to be different just because I assume their algorithm most be slightly differnt. Hmm -- I really should get my Oracle box up. I thought these things would be different in the last digits unless everyone in the world has decided on a standard way of calculating these things.

SELECT Geometry::STGeomFromText('POINT(774983.955351382 2949487.72566304)',2249).STDistance(Geometry::STGeomFromText('LINESTRING(775212.698335966 2948028.5564237,775121.04258857 2950068.84568561)',2249));

Gives: 163.028426002135

But Like you said nicklas -- those numbers at the end are kind of irrelevant. I'm not up with IEEE stuff so can't comment on what either of you are saying.

comment:21 Changed 11 years ago by nicklas

I think the reason, sql server gets the same answer as postgis of today is because they do the exactly same calculation.

It is described here: http://www.faqs.org/faqs/graphics/algorithms-faq/ and can be found by searching for: "Subject 1.02: How do I find the distance from a point to a line?"

So what postgis does today and apperently also sql server is to calculate the distance directly from the point to the line. A little bit further down in the text in the link it is described how to find the point on the line to where the distance is calculated. The thing then is that the point has the same "bad" precision as the original inputed vertexes wich dives a slightly different answer. That is no problem because the extra precision in the distance-answer makes no sense since the vertexes is inaccurate.

But then we know some more about the sql-server code :-). They have not prepared it for shortestline function I guess.

I think it is a strength if the distance-calculation gives exactly the same answer as the length of the shortest line and that I think should be the case now.

comment:22 Changed 11 years ago by robe

Ah yes its good we can figure out how a black box works by interrogating it.

Speaking of ST_ShortestLine and ST_LongestLine, as much as I hate non-symmetric functions and introducing trivial functions, I really think we should have a ST_ClosestPoint and ST_FurthestPoint. Which would be wrappers like the below (might be better to do them in C than as SQL wrappers)

ST_ClosestPoint(A,B) => SELECT ST_StartPoint(ST_ShortestLine(A,B));
ST_FurthestPoint(A,B) => SELECT ST_StartPoint(ST_LongestLine(A,B));

And as you can see would return the respective closest/furthest point on A from geometry B. The reason I think we need these is because I think 80% of the time people will just take our 2 gifts for the price of 1 and pull out one of them and discard the other. At least that's what in 80% of my usecases I would do.

Its also a vocabulary thing -- I think people think in terms of closest furthest, nearest furthest. So when you say shortest line/longest line -- I think people will be scratching their heads, where as if you say "This will give you the closest point", then lightbulbs will go off. The longest and shortest should of course stay since when you need both points at the same time you would use those.

comment:23 Changed 11 years ago by nicklas

I think you are right. It would be more desriptive name for the usual usecase. I was a little afraid to try to introduce that many new functions :-)

The reason for these functions would be for the vocabulary only, because it will give no better performance. Both points will be identified anyway in the process of finding the one of interest.

As you say it will also be the right choise to use shortest longest line when you need both because it will do the job for both points in the same time as closest furthest needs each.

So I agree, for vocabulary reasons. Should I put it on TODO-list?

comment:24 Changed 11 years ago by robe

Description: modified (diff)

Done

comment:25 Changed 11 years ago by robe

Description: modified (diff)

forgot the breaks

comment:26 Changed 11 years ago by nicklas

For the record

ST_ClosestPoint(A,B) => SELECT ST_StartPoint(ST_ShortestLine(A,B));
ST_FurthestPoint(A,B) => SELECT ST_StartPoint(ST_LongestLine(A,B));

above should be

ST_ClosestPoint(A,B) => SELECT ST_EndPoint(ST_ShortestLine(A,B));
ST_FurthestPoint(A,B) => SELECT ST_EndPoint(ST_LongestLine(A,B));

if I don't missunderstand you Regina.

Then you get the closest and most far point in geometry B, from geometry A
from the today definition of shortest and longest line.
Here might be a small problem in what people ecpect. closest point is no problem I thing but frthest point might give the impression that it is the point most fas from geometry A but it is the most far away point in geometry B from the most far away point in geometry A.

Maybe it is only closest point that is of real interest? To avoid missunderstanding.

Or maybe closest and furthest points should be the beginning and the end of shortest line instead? like this:

ST_ClosestPoint(A,B) => SELECT ST_StartPoint(ST_ShortestLine(A,B));
ST_FurthestPoint(A,B) => SELECT ST_EndPoint(ST_ShortestLine(A,B));

Then closestPoint is a point on geometry A and furthestPoint is a point on geometry B.

???

comment:27 Changed 11 years ago by robe

Hmm well closestPoint is definitely more clear than furthest point, and I see what you are saying. Maybe we should leave that one out and let people make it if they need it.

I guess I was thinking along the line of the question: If I were to design a train line (in this case I would draw my trainline as multipoints where each point is a train stop) to service an area. Which point in my service district has the furthest distance to trek to get to a train stop. So that's why I kind of thought of them similarly.

comment:28 Changed 11 years ago by robe

disregard what I said -- I see I'm falling into the same trap you described. Okay lets definutely leave that one out.

comment:29 Changed 11 years ago by nicklas

ok, but what about the start or the end of st_shortestline. I guess the end of the line is the most common to use to answer the question where is the closest land to swim to from this island.

But the next question is: from where on the island should I start to get the shortest swim-distance. or maybe a more common use. get the points on the road closest to the inaccurate GPS-points. Then it is most convinient with the startpoint of the shortestline again. st_startpoint(st_shortestline(the_road, the_GPS_point))

I have noticed some posts on the list with this type of problems. But I don't know wich approach is most intuitive.

I don't even see the alternatives clear :-)

comment:30 Changed 11 years ago by robe

For furthest one which we aren't going ot implement anyway -- that was a typo -- I meant ST_StartPoint.

I think ST_StartPoint is slightly clearer because it more closely follows the convention of the other non-symmetric geometry function I can think of ST_Difference.

E.g ST_Difference(A,B) -> returns a part of A

Reading it out loud its not as clear as the others

A contains B - A within B geomA closest point geomB -- hmm does sound a bit

though geomA closest point to geomB -- if you add a to there its a bit clearer

comment:31 Changed 11 years ago by nicklas

OK, I have added a function returning the point in the first inputed geometry that is closest to the second one geometry.

I still think it is difficult to find a good name. Now the name of the function is ST_ClosestP_in_A2B(geometry, geometry)

/Nicklas

comment:32 Changed 11 years ago by nicklas

I have not added any documentation or regression tests to this function yet. About the regression tests I don't know if it is nessecary. I guess it is better to focus the regression tests on st_shortestline, then the errors should show.

comment:33 Changed 11 years ago by robe

I guess naming is going to be tricky. I suppose no other spatial database supports this function that we can see what they call it?

I don't like ST_ClosestP_in_A2B for 2 reasons

1) too many underscores so is not consistent with how we name other functions. To be consistent it would only have the ST_ part and get rid of other _

2) the A2B while clearer again is inconsistent. When we say

ST_Difference(A,B) -- we don't call it ST_DifferenceAnotInB

Yes I know its an OGC standard, but still I think if we stick with the same kind of convention that any object that returns a geometry that is part of one of the geometries should be written as

somefunc(A,B) --- implies part returned is part of A

comment:34 Changed 11 years ago by nicklas

Ok, I agree with your arguments.

Is your first proposal maybe the best alternative ST_ClosestPoint(A,B)


My problem with that is that I intuitively feels that ST_ClosestPoint(A,B) returns a point from B. I think it maybe is because I have a feeling that my perspective should be from the first geometry and then I'm interested in the closest point in the second geometry. Like me being in a polygon of a country asking for the closest point in another country. Then I'm not intrested in a point in my own country. But I don't know it that perspective makes sense for anyone else :-)

comment:35 Changed 11 years ago by robe

Yah that was my first intuitive feeling too. For ST_Difference its a bit clearer since I think A - B.

I suppose it doesn't matter as long as we document it.

our short doc is going to say (which will show in psql and pgAdmin will say something like)

"Returns the point in geometry A that is closest to B."

or if you change the other way "Returns the point in geometry B that is closest to geometry A"

You are the inventor of this function -- you should have the final decision. Though would be nice to hear from others.

comment:36 Changed 11 years ago by nicklas

As I have mentioned, I have had an idea about better subgeometry-handling in the distance-calculations. Now I have come up with a first try.

What it does, simplified is:
1) unwinds all subgeometries and puts them in a list without hiearchy.
2) calculates bboxes for the subgeometries
3) order geometries from distances between bboxes
4) calculates real distances in list order until distance between bboxes is bigger than smallest found distance between real geometries.

For single geometries this might give a little overhead, but it doesn't seem to be noticable. There are still som things to do to let the single geometries passing by.

The best thing about this I think is the possibility to collect a lot of geometries and fast get the distance or with shortestline get a line to the closest geometry.
One examples to show why I belive in the idea. From the dataset Regina put in #137
To find the closest point in the rest of USA from Alaska we can then do:

select st_distance(a.the_geom, b.the_geom),
  st_endpoint(st_shortestline(a.the_geom, b.the_geom)) as the_geom
    from 
(select st_collect(the_geom) as the_geom 
  from us.states where state != 'Alaska') a,
(select the_geom as the_geom 
  from us.states where state = 'Alaska') b

Querytime at my laptop is less than 400ms and then we have compared alaska with 52 other states.

The patch subgeomhandling applies to my spike

It is probably not stable yet, but is it a good idea?
/Nicklas

comment:37 Changed 11 years ago by pramsey

Yes, I think it's a good idea. Keep working towards stability, we'll be rolling these patches in shortly for 1.5.

comment:38 Changed 11 years ago by nicklas

Now I've done two commits to spike /nicklas/distcalc

r4714 is renaming to st_closestpoint as discussed above. I also applied the hack with st_convexhull for max-distance functions. Because they then have to be below st_convexhull in the sql-file I moved all the new functions down do the bottom before the sql-functions.

The other commit r4715 to the spike is the new subgeometry handling. If it shows to have big problems please try before r4715 to see if the problem is caused before or after that.

It now passes the regression tests and gives the same answer at my usual test-cases. I also think it seems to perform well even when there is no sub-geoms.

/Nicklas

Changed 11 years ago by nicklas

Attachment: subgeomhandling.patch added

now included in spike/nicklas/distcalc

Changed 11 years ago by nicklas

Attachment: dist.odt added

same doc with some typos removed and in odt :-)

comment:39 Changed 11 years ago by pramsey

Resolution: fixed
Status: assignedclosed

Merged into trunk at r4894. Nicklas, no more changes on your spike please, if you want to to sandbox'ed work, start a new spike. Otherwise, put bug fixes and cleanups into trac as patches against trunk.

comment:40 Changed 11 years ago by nicklas

Ok, great :-)

Note: See TracTickets for help on using tickets.