Opened 15 years ago

Last modified 14 years ago

#231 closed enhancement

New distance-calculations proposal — at Version 24

Reported by: nicklas Owned by: robe
Priority: medium Milestone: PostGIS 1.5.0
Component: postgis Version:
Keywords: Cc:

Description (last modified by robe)

A hopefully more structured ticket than #137

The patch attached includes:
new functions:

  1. st_shortestline
  2. st_longestline
  3. st_dfullywithin
  4. st_max_distance
    (this is really a fix to a long standing bug — st_max_distance has existed since at least 1.2 but has never worked - (older versions give Not yet implemented))
  5. st_closestpoint
  6. st_furthestpoint

enhancements affecting st_distance, st_dwithin and the functions above:

  1. a better handling of "holes" in polygons.
  2. a sudden stop in iteration when the distance in st_dwithin is satisfied. Before it was checked for every subgeom.
  3. a new way of calculating distances between lines and polygons that don't intersect.


Probably the last one is the most important. On big geometries it makes a big difference.

Beside the patch I have attached a document where I try to explain the idea behind the distance-calculation.

/Nicklas

Change History (26)

by nicklas, 15 years ago

Attachment: dist.doc added

explanation of the distance calculation

comment:1 by robe, 15 years ago

Nick,

Does this particular patch include the fix for ST_Max_Distance as well (i saw it in the patch you sent). I know you had it in a separate patch and I said to hold it separately since it was a fix to a long standing bug and not an enhancement. Though not sure how everyone feels about back-porting that particular fix to 1.4.1 since 1.4 and below has the function but gives annoying message "This function is unimplemented yet" so technically its not an API change — its a bug fix.

Thanks, Regina

comment:2 by nicklas, 15 years ago

Yes, you are right, it includes st_max_distance too. Maybe you can put it in the description. I don't think I can. /Nicklas

comment:3 by robe, 15 years ago

Description: modified (diff)

comment:4 by robe, 15 years ago

Owner: changed from pramsey to robe
Status: newassigned

Nicklas,

I did some tests with this against my real data and it looks really good.

For example I did a test on the ST_DWithin for my 190,000 data set of parcels against my 15 boston neighborhoods and the answers from yours are the same as what I get with a PostGIS 1.4 install, but yours appears to be faster:

I ran the below query SELECT COUNT(*) As tot, n.name FROM neighborhoods As n INNER JOIN landparcels As l ON

ST_DWithin(n.the_geom, l.the_geom,0.001) GROUP BY n.name ORDER BY n.name

with my PostGIS 1.4 install — takes —385719ms - 1.4 to run this query - with my PostGIS 1.5SVN patched with your distance takes — — 249859 ms

I also did some random samples of ST_Distance between each neighborhood and a given landparcel and the numbers agreed down to as many decimals as displayed (around 9 decimal places).

So from a cursory user standpoint, this patch looks really good. I'll try to do more extensive testings and test out the ST_Max_Distance. Since I don't have a regress that since its broken in older releases and new functions, I'll probably have to pull out a pen and paper for those or some plpgsql equivalent to test against it.

Mark,Paul, Kevin — can you please look at Nicklas patch. I was afraid to look at the code since I'm not a good judge of c-programming, but if you think it looks fine and no glaring errors, I vote to accept it into 1.5 and consider back-porting (the speed and st_max_distance fix) to 1.4.

Thanks, Regina

comment:5 by nicklas, 15 years ago

Great

One thing to check is if we should use the old calculations for smaller geometries. I guess maybe my function gives more overhead-cost than gain. It also would be very easy to put in a rule where for instance npoints og geom1 * npoints of geom2 is more than 500 or something like that otherwise it just uses the old one. The real gain is big geometries far from eachother. Then I have had cases where the querytime reduses from 220000ms to jus about 2000 ms. Try the queries at your slide 19 you mentioned, Regina. I think the querytime is reduced by approx 90% at least eat some of them :-)

Now I'm working on a proposal for better handling of subgeoms. There is two good things about it. First, of course it will reduce querytime when there is many subgeoms involved in the geometries. st_distance(alaska, texas) in Reginas dataset takes about 7 seconds with my patch, but with this subgeoms handling it is under 1 sec.

But the other nice thing about it, is that it's usable innearest neighbour queries. just collect all geometries and get the closest point with st_shortestline. Then st_shortestline first arrange all incomming geometries like it would have been just any subgeoms. find the closest one and returns a line to the closest point of it.

I think it's a good thing. I have it running for st_shortestline and st_distance but I have to modifie it to also take st_max_distance and st_longestline.

comment:6 by robe, 15 years ago

Yah still have to pass it thru my torture tester. By the way — for the sampling of 4 parcels I did against the 15 neighborhoods (60 records)

SELECT n.name, l.pid, ST_Distance(l.the_geom, n.the_geom) As dist from boston.nei As n CROSS JOIN boston.sample_land As l ORDER BY l.pid, n.name;

Yours completes (1.5 with dist patch) in 47 ms The postgis 1.4 (no dist patch) completes in 593 ms and returns the same answer.

So I'm pretty excited so far. Well we will see after I test against every permutation of geometry types. The above were just largish multipolygon against small polygon

Thanks, Regina

comment:7 by nicklas, 15 years ago

I have found some serous trouble but I hope we can fins some way around.

this is not working with the pach:

select st_distance('LINESTRING(9 5, 10 10, 10 20 ,9 25)'::geometry,'LINESTRING(6 17, 8 17)'::geometry);

select st_astext(st_shortestline('LINESTRING(9 5, 10 10, 10 20 ,9 25 )'::geometry,'LINESTRING(6 17, 8 17)'::geometry));

I don't know how I can have missed that case, it is very common. This is bad news but I think there should be some way around to find.

/Nicklas

comment:8 by robe, 15 years ago

Hmm I see. Strange. I ran my torture ST_Distance tests against this (but it wasn't really designed to catch corner cases). It did come up with when dealing with M and Z geometries, your distance is off by a decimal from the other. I wonder if this is some other difference between 1.4 and 1.5 and not specifically your changes.

See below query: In 8.4 1.4 gives — 5.65685424949238

4.24264068711929

In 8.4 1.5 with your patch gives — 5.65685424949238

4.24264068711928

But its not a significant difference: —ST_Distance LineSet3D(g1, g2): Start Testing MultiLineMSet, LINESTRING

SELECT ST_Distance(foo1.the_geom, foo2.the_geom)

FROM ((SELECT ST_Collect(s.the_geom) As the_geom

FROM (SELECT ST_MakeLine(ST_SetSRID(ST_MakePointM(i,j,m),4326),ST_SetSRID(ST_MakePointM(j,i,m),4326)) As the_geom FROM generate_series(-10,50,20) As i

CROSS JOIN generate_series(50,70, 25) As j CROSS JOIN generate_series(1,2) As m WHERE NOT(i = j)) As s)) As foo1 CROSS JOIN ((SELECT ST_SetSRID(ST_MakeLine(ST_MakePoint(i,j,k), ST_MakePoint(i+k,j+k,k)),4326) As the_geom

FROM generate_series(-10,50,20) As i

CROSS JOIN generate_series(40,70, 20) j CROSS JOIN generate_series(1,2) k ORDER BY i, j, i+j+k, k, i*j*k)) As foo2 LIMIT 2;

comment:9 by nicklas, 15 years ago

I think I can explain this difference. In my functions I always calculate a point along an edge, because it's needed in st_shortestline and then I push that point to the point-point distance calculation just the same way as if it would have been two vertexes. Before the distance was calculated to the edge without identyfying the point. Because of that there is a round at the last decimal to store the point and I guess that's why it differs there.

About the issue I mentioned earlier it seems to be fixable, but it will take a day or two before I have time for it.

/Nicklas

comment:10 by nicklas, 15 years ago

The work in this ticket is now found in :

http://svn.osgeo.org/postgis/spike/nicklas/distcalc/

there, the bug mentioned here 08/07/09 11:29:24 is now fixed.

This means the patch here is outdated.

/Nicklas

comment:11 by robe, 15 years ago

Nicklas, I took a quick look at your code. Your commenting style for function description is a little off so I don't think it will be picked up by our doxygen auto documentor. YOu probably made the mistake of copying some of the bad commenting style in the other parts of the code base. Please read below (which is still a work in progress).

http://trac.osgeo.org/postgis/wiki/DevWikiPatch

Paul if you are not asleep, can you fill in the section on CUnit Testing http://trac.osgeo.org/postgis/wiki/DevWikiMain

comment:12 by nicklas, 15 years ago

Yes, I saw the wiki-post yesterday and will take a look at it as soon as possible.

/Nicklas

by nicklas, 15 years ago

Attachment: distance.patch added

comment:13 by nicklas, 15 years ago

As discussed above in august the distance-calculation of today gives a slightly different answer in some cases than this new calculation.

here is an example:

select[[BR]]
st_distance('POINT(774983.955351382 2949487.72566304)'::geometry,[[BR]]
 'LINESTRING(775212.698335966 2948028.5564237,775121.04258857 [[BR]]
2950068.84568561)'::geometry)

in todays calculation this gives :
163.028426002135 meters, and in my calulation:
163.028426002146 meters

The thing is that the answer in the distance-calculations is given in double precision data type. in this case that gives room for 12 decimals but the coordinates that the distance is calculated from only has 8 or 9 decimals. So no one of the answers is right or wrong. The reason for the difference between the two ways of calculating is that in my calculation a point is calculated on the line and then is the distance calculated to that point instead of directly to the line in itself. That point also have the smaller number of decimals so the answer will get another round-of error than the calculatin of today.

But note, no one of the answers is right. The deciamals beyond the coordinates precision is worthless (besides they represent about the distance between the eyes of a bacteria if they had eyes) since they can't be calculated from the coordinates precision.

How to handle this. It is very small distances we are talking about but it can cause trouble in regression tests and if people is comparing distances on this last decimals.

I guess this is a common issue. Is it a problem? Have I understood the precision problem of floating data-types right.

comment:14 by robe, 15 years ago

For what its worth. The above example is using Mass State Plane feet (not meters), so from my perspective the difference is even less significant. An the exercise I did this on (150,000 parcel distance from a line, I think this accounted for 25000 records. The others were exactly the same).

I suspect it will change from platform to platform and possibly even the GCC version you compile with.

I think any regress test we do will take into consideration some sort of tolerance and we'll need to decide what that is. I was thinking to judge the rightness of each, we should probably run the same test against another spatial database like SQL Server or Oracle or IBM DB2 (preferably all 3) and see what their distance measurements come up with.

I can try the same exercise loading the data into SQL Server and possibly Oracle to see how far those are from these. I'm curious myself to see how far our distance calculations compare to other databases.

comment:15 by nicklas, 15 years ago

Yes, this will never be of practical importance. But I think it was a little bit interesting, I have never thought of that there is room for more precision in storing the geometries closer to the origo of the coordinate system than far away. If the "whole-unit part" of the coordinate takes 7 positions there is only 7 more to describe the decimals compared to 13 or maybe even 14 decimals, closer to origo than one unit.

I have played a little with the numbers and I don't see the logic how many decimals is returned in our calculations. Sometimes it seems like the same number is returned as inputed but not considering how many digits representing decimals and how many representing whole units.

Sometimes I get the whole, double precision answer, even if it is not needed. In this example this ccauses a little strange answer, I think:

select st_length('LINESTRING(10000 1,10000.1 1)'::geometry)

I get 0.100000000000364 instead of 0.1

What is deciding how many decimals should be returned? is it the planner or the postgis functions?

/Nicklas

comment:16 by robe, 15 years ago

Curiously I get the same answer in SQL Server 2008

SELECT Geometry::STGeomFromText('LINESTRING(10000 1,10000.1 1)',0).STLength();

yields: 0.100000000000364

Have no clue. But I guess it can't be something specific about PostgreSQL — unless its some old code borrowed way back by the original Ingres team before split off to form (PostgreSQL/Illustra/Informix , Sybase → SQL Server)

comment:17 by pramsey, 15 years ago

You're just bumping up against the internal maximum precision. 0.1 cannot be stored in whole numbers in IEEE double, so you end up with a (repeating, usually) decimal if you print out enough digits.

comment:18 by nicklas, 15 years ago

ok, but why does this give 0.1

select st_length('LINESTRING(1 1,1.1 1)'::geometry);

comment:19 by pramsey, 15 years ago

Perhaps someone will post the canonical examples of representation/precision issues. Like the same mathematical answer giving different representations depending on what order the operands are applied. Anyhow, if I tell you that 0.99999999, 1.0 and 1.00000000000001 all are the "same", does that give you a flavour? Different underlying ieee results that represent the same attempt at representing a double will spool out to different representations. Write up your own little test program that fires out 35 decimal places and try different little bits of math. Unfortunately, our string writing code is kind of all hand-balled into place. nice programmer rennovation might be to find all the places that print coordinates and build a standard API for them so we can control apparent output precision in one location.

comment:20 by robe, 15 years ago

Interesting SQL Server also agrees with our original distance calculation. I was really expecting the last set of digits to be different just because I assume their algorithm most be slightly differnt. Hmm — I really should get my Oracle box up. I thought these things would be different in the last digits unless everyone in the world has decided on a standard way of calculating these things.

SELECT Geometry::STGeomFromText('POINT(774983.955351382 2949487.72566304)',2249).STDistance(Geometry::STGeomFromText('LINESTRING(775212.698335966 2948028.5564237,775121.04258857 2950068.84568561)',2249));

Gives: 163.028426002135

But Like you said nicklas — those numbers at the end are kind of irrelevant. I'm not up with IEEE stuff so can't comment on what either of you are saying.

comment:21 by nicklas, 15 years ago

I think the reason, sql server gets the same answer as postgis of today is because they do the exactly same calculation.

It is described here: http://www.faqs.org/faqs/graphics/algorithms-faq/ and can be found by searching for: "Subject 1.02: How do I find the distance from a point to a line?"

So what postgis does today and apperently also sql server is to calculate the distance directly from the point to the line. A little bit further down in the text in the link it is described how to find the point on the line to where the distance is calculated. The thing then is that the point has the same "bad" precision as the original inputed vertexes wich dives a slightly different answer. That is no problem because the extra precision in the distance-answer makes no sense since the vertexes is inaccurate.

But then we know some more about the sql-server code :-). They have not prepared it for shortestline function I guess.

I think it is a strength if the distance-calculation gives exactly the same answer as the length of the shortest line and that I think should be the case now.

comment:22 by robe, 15 years ago

Ah yes its good we can figure out how a black box works by interrogating it.

Speaking of ST_ShortestLine and ST_LongestLine, as much as I hate non-symmetric functions and introducing trivial functions, I really think we should have a ST_ClosestPoint and ST_FurthestPoint. Which would be wrappers like the below (might be better to do them in C than as SQL wrappers)

ST_ClosestPoint(A,B) => SELECT ST_StartPoint(ST_ShortestLine(A,B));
ST_FurthestPoint(A,B) => SELECT ST_StartPoint(ST_LongestLine(A,B));

And as you can see would return the respective closest/furthest point on A from geometry B. The reason I think we need these is because I think 80% of the time people will just take our 2 gifts for the price of 1 and pull out one of them and discard the other. At least that's what in 80% of my usecases I would do.

Its also a vocabulary thing — I think people think in terms of closest furthest, nearest furthest. So when you say shortest line/longest line — I think people will be scratching their heads, where as if you say "This will give you the closest point", then lightbulbs will go off. The longest and shortest should of course stay since when you need both points at the same time you would use those.

comment:23 by nicklas, 15 years ago

I think you are right. It would be more desriptive name for the usual usecase. I was a little afraid to try to introduce that many new functions :-)

The reason for these functions would be for the vocabulary only, because it will give no better performance. Both points will be identified anyway in the process of finding the one of interest.

As you say it will also be the right choise to use shortest longest line when you need both because it will do the job for both points in the same time as closest furthest needs each.

So I agree, for vocabulary reasons. Should I put it on TODO-list?

comment:24 by robe, 15 years ago

Description: modified (diff)

Done

Note: See TracTickets for help on using tickets.