Opened 15 years ago

Closed 15 years ago

Last modified 15 years ago

#230 closed defect (fixed)

Put in costs in _ST and possibly other functions

Reported by: nicklas Owned by: robe
Priority: medium Milestone: PostGIS 1.5.0
Component: postgis Version: 1.4
Keywords: st_dwithin st_expand bbox Cc:

Description (last modified by robe)

I'm revising this since this seems to be an issue mostly because we don't have COSTS assigned to our _ST functions, so the planner in certain cases (like with large geometries), chooses to do a table scan instead of an index scan.

Since costs are only supported in PostgreSQL 8.3+ and we are also going to need typmod support for geography. I'm wondering if we want to just push this to 1.5 and also make PostgreSQL 8.3+ a requirement for 1.5 instead of introducing messy conditional code to not put costs in if 8.2 or lower.


using dataset from states.backup as example

the bbox-comparasion doesn't seem to always happen before the _st_dwithin calculations. To isolate the problem I used _st_dwithin instead of st_dwithin and played with the bbox-comparasions in the query instead. then, if I run:

select a.state from us.states a, us.states b 
	where b.state = 'Hawaii' and a.state != 'Hawaii' 
		and a.the_geom && b.the_geom;

or 

select a.state from us.states a, us.states b 
	where b.state = 'Hawaii' and a.state != 'Hawaii' 
		and st_expand(a.the_geom, 0) && b.the_geom; 

it tells in 30 ms that there is no hit, just as expected the same happens if I try:

select a.state from us.states a, us.states b 
	where b.state = 'Hawaii' and a.state != 'Hawaii' 
		and a.the_geom && b.the_geom and _st_dwithin(a.the_geom, b.the_geom, 0);

fast answer that there is no hit.

BUT

if I run

select a.state from us.states a, us.states b 
	where b.state = 'Hawaii' and a.state != 'Hawaii' 
		and st_expand(a.the_geom, 0) && b.the_geom and _st_dwithin(a.the_geom, b.the_geom, 0);

then, the fun is over. It starts running for 220000 ms before I get the answer. My guess is that in this case _st_dwithin is trigged before the bbox-comparation so it makes the distance calculations to every one of the other polygons in the dataset. It behaves the same if I run:

select a.state from us.states a, us.states b 
	where b.state = 'Hawaii' and a.state != 'Hawaii' 
		and st_dwithin(a.the_geom, b.the_geom,0)

where the function is supposed to handle the bbox-comparasion.

/Nicklas

Attachments (1)

states.backup (5.6 MB ) - added by nicklas 15 years ago.
Dataset from ticket 137

Change History (28)

by nicklas, 15 years ago

Attachment: states.backup added

Dataset from ticket 137

comment:1 by nicklas, 15 years ago

Milestone: postgis 1.5.0postgis 1.4.1
Version: 1.4

comment:2 by robe, 15 years ago

Owner: changed from pramsey to robe
Status: newassigned

Nicklas, I don't have proof of this, but I suspect this problem only happens with really large geometries and also if you don't have a limit or your limit reaches a certain threshold.

I'm putting the blame right square on this st_expand(geometry, double precision) overloaded function. I suspect for large geometries — this is passing the huge geometry down the throat of the C layer (when in theory it should just be reading the bounding box and expanding that) in (and also some how messing up the planners favor of an index). Well that's just a theory. I haven't fiddled with costs and so forth.

If you look at our slides — we some how managed to escape this horrible fate you have demonstrated, but I did confirm your results happen even on 1.3)

See slide 19 all use indexes except middle one which we wouldn't expect to — strange huh. Only difference is we use a limit and also we simplify in later cases to reduce the size of the geometry.

http://www.bostongis.com/downloads/oscon2009/Oscon2009_PostGISTips.pdf

comment:3 by nicklas, 15 years ago

I have been looking some more on this. I don't think that it is st_expand that uses all the power, but when we get over some border the planner decides to do the _st_dwithin before bbox comparasion.

both _st_dwithin and st_expand has cost 1. I guess that makes it a little randomly wich one is calculated first. Is there a reason for that. I tried to give _st_dwithin cost 100 and that seems to solve the problem.

/Nicklas

comment:4 by robe, 15 years ago

Well the main issue is that costs were introduced in 8.3 so they aren't supported in 8.2. That is one reason we haven't applied costs. So we'll need ot incorporate it in such as fashion as to not apply to 8.2.

The other issue is that we just haven't gotten around to applying costs to all the functions. I think we can take an easy rule of thumb and just set them all to 100 except for the operators though. Then apply say 200 for more intensive like ST_Distance.

comment:5 by nicklas, 15 years ago

Yes, in cases like this a distinction would be valuable.

The thing is here that st_distance has cost 100 but that doesn't help the planner to decide when to use _st_dwithin.

/Nicklas

comment:6 by robe, 15 years ago

It wouldn't. _St_Dwithin and ST_Distance are completely different functions as far as PostgreSQL is concerned. Prior to 1.3 ST_DWithin did use ST_Distance, but it doesn't anymore.

comment:7 by nicklas, 15 years ago

sorry, wrong from me. I meant st_dwithin now has cost 100, but _st_dwithin has cost 1.

I didn't mean to put in st_distance

/Nicklas

comment:8 by robe, 15 years ago

Why would it? Because its written in SQL, the function is transparent (sql functions are generally inlined). See slides — the link to the index call is part of ST_DWithin. You will also notice that if you look at the pgadmin planner — the ST_Dwithin call is split into 2 — the index match and the _ST_DWithin with an extra superfluous EXPAND that never uses an index).

The extra expand is simply to make the function symettric. Even though the answers to expand are identical — one is indexable and one is not (and only at run-time would you know which one is and which one isn't) so the planner will choose to test the indexable one first and never test the rest of the function if the index call fails.

comment:9 by nicklas, 15 years ago

I don't know if I understand you right Regina, but my suggestion comes from this: I use the last example above with st_dwithin

with cost 1 at _st_dwithin, the explain of the query looks like this. It puts _st_dwithin before the bbox comparasions:

"Nested Loop (cost=0.00..4.50 rows=1 width=9)" " Join Filter: (_st_dwithin(a.the_geom, b.the_geom, 0::double precision) AND (a.the_geom && st_expand(b.the_geom, 0::double precision)) AND (b.the_geom && st_expand(a.the_geom, 0::double precision)))" " → Seq Scan on states b (cost=0.00..1.66 rows=1 width=442568)" " Filter: ((state)::text = 'Hawaii'::text)" " → Seq Scan on states a (cost=0.00..1.66 rows=52 width=442577)" " Filter: ((a.state)::text ≠ 'Hawaii'::text)"

But if I change the cost of _st_dwithin to 100 the same exlpanation looks like this:

"Nested Loop (cost=0.00..10.21 rows=1 width=9)" " Join Filter: ((b.the_geom && st_expand(a.the_geom, 0::double precision)) AND _st_dwithin(a.the_geom, b.the_geom, 0::double precision))" " → Seq Scan on states b (cost=0.00..1.66 rows=1 width=442568)" " Filter: ((state)::text = 'Hawaii'::text)" " → Index Scan using us_idx_states_the_geom on states a (cost=0.00..8.27 rows=1 width=442577)" " Index Cond: (a.the_geom && st_expand(b.the_geom, 0::double precision))" " Filter: ((a.state)::text ≠ 'Hawaii'::text)"

Now it puts _st_dwithin after the bounding-box comparasion and the query runs in 16ms instead of very very long time.

/Nicklas

comment:10 by robe, 15 years ago

Exactly observe ( that part is coming from ST_DWithin — if you look at the definition of ST_DWithin — it has 2 expand calls - this is one of them)

" Index Cond: (a.the_geom && st_expand(b.the_geom, 0::double precision))" " Filter: ((a.state)::text ≠ 'Hawaii'::text)

The join filter part has the portion of ST_DWithin that doesn't use an index (see how the second expand that is pushed to index is missing)

" Join Filter: ((b.the_geom && st_expand(a.the_geom, 0::double

precision)) AND _st_dwithin(a.the_geom, b.the_geom, 0::double precision))"

The planner has split the function into its constituent parts. So lesson learned putting cost on transparent SQL functions that use others is not effective.

comment:11 by nicklas, 15 years ago

Ok I've learned a lot about how to understand the explain. Thanks :-)

But I still don't get what the downside is to put cost 100 to _st_dwithin. The it starts using the index and it all looks fine… ,not ??

I will look through your slides to learn more.

Thanks Nicklas

comment:12 by robe, 15 years ago

nothing wrong with that I can think of. In fact I think all the _st functions should have costs 100 sounds as good as any.

comment:13 by nicklas, 15 years ago

Great :-) I missunderstood you.

comment:14 by robe, 15 years ago

Description: modified (diff)
Milestone: postgis 1.4.1postgis 1.5.0
Summary: st_expand seems to affect the execution order wich affects st_dwithinPut in costs in _ST and possibly other functions

comment:15 by nicklas, 15 years ago

Is there a problem for postgresql8.2 if we put costs in? Isn't the costs is there now but all is set to cost 1? Will postgresql 8.2 installations be affected by diversing costs?

/Nicklas

comment:16 by robe, 15 years ago

I don't think PostgreSQL 8.2 understands functions costs. That feature was introduced in 8.3 and the planner was changed accordingly to look for it. So if we put costs in I'm pretty sure the .sql files won't load in 8.2. though I haven't verified but I do recall in some o

comment:17 by nicklas, 15 years ago

OK, I see

cost 1 in postgresql 8.3 and 8.4 is just default values. From what comes cost 1 in 8.3, 8.4 installations, it is not mentioned in postgis.sql.in.c?

Couldn't we put the costs in a separate sql-file with every function represented like:

ALTER FUNCTION _ST_DWithin(geometry,geometry,float8) cost 100;

I can see some advantages of that.

1) in installation just one check for postgresql version and if 8.3 or above —> run costs.sql after postgis.sql
2) users could easily move customized costs from one installation to another.
3) we could put new recommended costs in the wiki for example, between releases.

Espescially in the beginning I guess there will be some calibration in the recommended costs. 2 and 3 above is of course not depending on that we distribute costs.sql in the installation, but then the user have one place to play with the costs and a recommended way of storing the customizing by editing the costs.sql.

Wouldn't it be easier too, to try something like Kevin suggesting of getting costs from statistics if they where held separately.

/Nicklas

comment:18 by robe, 15 years ago

Nicklas, Very Good idea. I like the idea of keeping the costs in a separate file. That way if people decide they want different costs for their functions, they can simply just not run that file. However — I still want to get rid of 8.2, just from a keeping my sanity point of view. I would like it that at any point in time, we only need to test at most 2 versions of PostgreSQL for new features. Trying to remember what we can and can't do for 2 versions is enough without having to worry about a third. Now we have 8.3 and 8.4 and 8.5 has already started their cycle — so we'll want to start testing out 8.5 in probably another 3-4 months (or less).

I'm not saying we won't support 8.2. we just shouldn't support it for 1.5 and beyond. I still expect we will support 1.4, 1.3 for years to come, but that support will just be security fixes and major bug fixes that could corrupt data.

comment:19 by nicklas, 15 years ago

Sounds fair and with the costs in a separate file we can still avoid issues like the cause of this ticket for people running 1.3 and 1.4 in postgresql 8.3 or above. And the people wanting the new stuff in 1.5 probably want the new stuff in postgresql 8.4 too.

/Nicklas

comment:20 by mcayland, 15 years ago

Since the geography type will require 8.3 minimum for geography support, then lets just add the costs directly into the .sql file. I agree that we should start by simply adding costs to the functions within any indexable ST_* functions, perhaps a cost of 100 to the more expensive function and then do some testing to see what happens there.

ATB,

Mark.

comment:21 by nicklas, 15 years ago

Mark, by .sql file do you mean postgis.sql? Isn't it much easier to change the costs if they are in a separate file?

/Nicklas

comment:22 by mcayland, 15 years ago

Yes, correct. The aim should be for us to pick defaults that "just work" for the majority of people rather than complicating matters by having to tune individual installations.

I know from my work on the statistics side of PostGIS that it is a pointless exercise trying to perfect our estimates so that they are accurate to reality; this is mainly because the existing routines in PostgreSQL are only estimates too. All we need to do is ensure that the more expensive functions are weighted heavier than their cheaper equivalents by a reasonable amount of units so given a dead heat in terms of planner cost, it will always choose the cheaper function first.

ATB,

Mark.

comment:23 by pramsey, 15 years ago

Resolution: fixed
Status: assignedclosed

Committed to trunk at r4757

comment:24 by kneufeld, 15 years ago

Paul, setting all the functional costs to 100 is a great start but shouldn't this ticket remain open? I think the point of this is to inform PostgreSQL of functional cost estimates of every _ST function so it knows how to sort them in the planner.

It's much more costly to use buffer than it is to compute the centroid of a geometry. So if I have "WHERE ST_IsValid(ST_Buffer(geom)) AND ST_Distance(ST_Centroid(geom), 'POINT(0 0)') > 10" I would want the planner to automatically place the centroid calculation first, THEN compute buffer if the first half of the condition succeeds.

For this, we'll need functional estimates on every function that appropriately estimate the cost.

http://postgis.refractions.net/pipermail/postgis-devel/2009-July/006109.html

comment:25 by pramsey, 15 years ago

The proximate defect flagged by the ticket has been solved with this commit. If you want an enhancement ticket for a future version, make up a new one in a reasonable milestone (and bear in mind you will be the one doing the implementation :)

comment:26 by robe, 15 years ago

I agree with Kevin on this. Paul I think you have gone a bit overboard.

Why can't we set just the obvious ones that have caused problems

_ST_DWithin, _ST_Intersects (and the other predicates). This is really more of a situation problem when indexes are involved as the lower costs sometimes makes the index operation not happen.

I'm also very concerned about the rate you are closing things off :). I should be excited, but all I can think is "Paul, stay away from the caffeine". When are we going to have time to test all this.

comment:27 by mcayland, 15 years ago

+1 from me, we definitely need to ensure the original test cases work as defined as its a common use case.

I also notice in your commit you've bumped the minimum PostgreSQL version to 8.3 in the documentation, although isn't this actually a requirement of incorporating the new geography typmod code as well as adding costs to functions? I think any change that requires a minimum PostgreSQL version change should incur a polite email to -devel for feedback rather than just committing blind.

Also, if we do bump the minimum version, you need to update the minimum version check in configure.ac.

ATB,

Mark.

Note: See TracTickets for help on using tickets.