Opened 6 years ago

Closed 6 years ago

#1828 closed defect (fixed)

Estimate returned by geography_gist_selectivity results in slow query plan for ST_DWithin

Reported by: realityexists Owned by: pramsey
Priority: medium Milestone: PostGIS 2.0.2
Component: postgis Version: 2.0.x
Keywords: Cc:

Description

I've run into a query performance problem that appears to be due to the && opeartor selectivity estimate (geography_gist_selectivity function).

I have a large, complex query which joins the results of several set-returning functions with some tables and filters them by calling another function, which involves 3 ST_DWithin(geog) calls. This used to run in about 10 seconds until I changed the functions to allow them to be inlined. (They previously had "SET search_path FROM current", which prevented inlining.) Now the query doesn't return in 10 minutes! If I again prevent the filtering function from being inlined (eg. by adding STRICT or SECURITY DEFINER or SET) the time goes down to 20 seconds. If I do the same to one of the set-returning functions it goes down to 15 seconds. It seems to change the query plan at the top level: without inlining it picks a Hash Join or Merge Join (fast), but with inlining it picks a Nested Loop (slow).

I asked about this on pgsql-general http://archives.postgresql.org/pgsql-general/2012-05/msg00370.php and Tom Lane suggested that it might be due to a PostGIS selectivity function returning a poor estimate. When the function is not inlined it uses the default estimate, which happens to be good in this case. If I hack postgis.sql to remove the RESTRICT and JOIN clauses, ie.

CREATE OPERATOR && (

LEFTARG = geography, RIGHTARG = geography, PROCEDURE = geography_overlaps, COMMUTATOR = '&&'

);

then the query runs in 5 seconds, which seems to support this theory.

I can reproduce the problem with the attached simplified test case, running on PostgreSQL 9.1.3 with PostGIS 2.0.0. Run initialise.sql, then query.sql. Note that the query is initially fast (140 ms), but after running ANALYZE the query plan changes from Hash Join to Nested Loop and it takes 15000 ms. If you then delete the table statistics again it goes back to the fast plan. If I mark test_points_are_near as STRICT it uses the fast plan. If I remove one of the ST_DWithin calls it uses the fast plan (see plans.txt).

Attachments (9)

initialise.sql (735 bytes) - added by realityexists 6 years ago.
query.sql (884 bytes) - added by realityexists 6 years ago.
plans.txt (4.1 KB) - added by realityexists 6 years ago.
boxcalc.patch (953 bytes) - added by pramsey 6 years ago.
Pick the box out of the serialization rather than calculating it from scratch.
boxcalc2.patch (2.2 KB) - added by pramsey 6 years ago.
Cleaner patch to read the cached box.
boxcalc3.patch (2.7 KB) - added by pramsey 6 years ago.
Also pull the cached box in the stats calculation stage
backtrace-r10728.txt (1.8 KB) - added by realityexists 6 years ago.
log-r10728.txt (6.3 KB) - added by realityexists 6 years ago.
backtrace-r10729.txt (1.9 KB) - added by realityexists 6 years ago.
Crash as of r10729

Download all attachments as: .zip

Change History (69)

Changed 6 years ago by realityexists

Attachment: initialise.sql added

Changed 6 years ago by realityexists

Attachment: query.sql added

comment:1 Changed 6 years ago by realityexists

Something unrelated I noticed while investigating this: line 1551 in lwgeodetic.c says

	if ( pa1->npoints == 0 || pa1->npoints == 0 )

Is that meant to be

	if ( pa1->npoints == 0 || pa2->npoints == 0 )

?

comment:2 Changed 6 years ago by pramsey

The EXPLAIN ANALYZE plans would be more handy.

comment:3 Changed 6 years ago by pramsey

Removing the RESTRICT and JOIN causes the planner to fall back to the default selectivity estimate which is basically "yeah, it's really selective!" in all cases. Which turns out to be good for your particular query, less good for others. The problem is that somewhere, somehow our estimator is underestimating the selectivity of your particular query parameters, either because the estimate calculations are wrong or perhaps because the stats gathered or wrong or perhaps because of something special/interesting about your geometries themselves.

Changed 6 years ago by realityexists

Attachment: plans.txt added

comment:4 Changed 6 years ago by realityexists

Thanks, what you wrote confirms my understanding of it. EXPLAIN ANALYZE output attached.

comment:5 Changed 6 years ago by pramsey

So, yeah, the planner expects one row to pop out of your spatial filter and instead 6000 do, which throws off the rest of the plan. Can you strip down the query to just the spatial table and filter and see if EXPLAIN ANALYZE still gets things that wrong? Because if you can get it down to one table and one piece of simple SQL we'll have something to work with, repairwise.

comment:6 Changed 6 years ago by realityexists

No, I tried that, but it's fast then. This is the simplest repro I've been able to come up with (and even that took about 5 hours!)

Eg. this query

select id
from _test_pos
where __test_points_are_near('POINT(7 7)', pos);

takes 80 ms and the plan is

Seq Scan on _test_pos  (cost=0.00..10728.00 rows=1 width=4) (actual time=13.492..56.078 rows=5118 loops=1)
  Filter: ((('0101000020E61000000000000000001C400000000000001C40'::geography && _st_expand(pos, 300000::double precision)) AND (pos && '0101000020E61000000000000000001C400000000000001C40'::geography) AND _st_dwithin('0101000020E61000000000000000001C400000000000001C40'::geography, pos, 300000::double precision, true)) OR (('0101000020E61000000000000000001C400000000000001C40'::geography && _st_expand(pos, 400000::double precision)) AND (pos && '0101000020E61000000000000000001C400000000000001C40'::geography) AND _st_dwithin('0101000020E61000000000000000001C400000000000001C40'::geography, pos, 400000::double precision, true)))
Total runtime: 61.388 ms

comment:7 Changed 6 years ago by pramsey

Yeah, this is good though, the point is not the speed, it's the bad estimate. The spatial stuff comes up fast enough, but because the estimate is bad, it lands in the middle of a bad plan which is itself very slow. Your explain shows the behavior we need to diagnose: estimated number of rows is 1, actual rows is 5118. Can you provide the table you are running this SQL against? (And further, cut down the query so it doesn't use a function anymore but still displays the bad estimate?)

comment:8 Changed 6 years ago by realityexists

Sure, in that case the query can be as simple as:

SELECT id
FROM _test_pos
WHERE ST_DWithin(pos, 'POINT(7 7)', 300000)

The table:

CREATE TABLE _test_pos (
    id serial,
    pos geography(Point,4326)
);


WITH data AS
(
	SELECT generate_series(1, 20000)::float / 1000 AS x
)
INSERT INTO _test_pos(pos)
SELECT ST_MakePoint(x, x)::geography
FROM data;

comment:9 Changed 6 years ago by pramsey

Confirmed, that's a lovely little test that shows a pretty huge selectivity misestimate. No matter what the search radius, small or large, the selectivity estimate of rows returned is 1.

comment:10 Changed 6 years ago by mcayland

Is this still based upon the code I originally wrote? I've had some ideas about improving it for some time, but obviously I now don't have access to same variety of data sets that I once did. If the above is a fairly concrete test case then I'm happy to take a look.

comment:11 Changed 6 years ago by pramsey

Yes, it's still yours. It's a synthetic case, but it's not unreasonable.

comment:12 Changed 6 years ago by mcayland

Okay - if you assign it to me in trac, I'll consider that officially handing over the (olympic?) torch.

comment:13 Changed 6 years ago by strk

Owner: changed from pramsey to mcayland

How does the geography estimator differ from the geometry one ? Does the geometry estimator work fine in this case ?

comment:14 Changed 6 years ago by realityexists

The geometry one seems to have the same problem:

Seq Scan on _test_pos_geom  (cost=0.00..5517.00 rows=1 width=4) (actual time=3.342..14.989 rows=1415 loops=1)

comment:15 Changed 6 years ago by strk

How many points does the table have ?

mcayland: are you planning to look at this any soon ? Is it still the code originating in the 1.0.0 times ?

There should be a perl script to test it under utils.

comment:16 Changed 6 years ago by strk

So, it is 20000 rows. Estimate is 1 row if there's no index. The prepare.sql script creates no index. Is it intentional ? Our analyzer is not even called in that case.

comment:17 Changed 6 years ago by strk

rewind, I take it back. The estimator is called anyway. For geometry the query must be changed to :

SELECT count(id) FROM _test_pos 
WHERE geom && ST_Buffer('SRID=4326;POINT(7 7)', 1);

Here's some debugging from the selectivity estimator:

NOTICE:  [geometry_estimate.c:geometry_gist_sel_2d:653] geometry_gist_sel called
NOTICE:  [geometry_estimate.c:estimate_selectivity:446]  histogram has 126 cols, 127 rows
NOTICE:  [geometry_estimate.c:estimate_selectivity:447]  histogram geosize is 19.999001x19.999001
NOTICE:  [geometry_estimate.c:estimate_selectivity:591]  search_box overlaps 182.000000 cells
NOTICE:  [geometry_estimate.c:estimate_selectivity:592]  avg feat overlaps 1.000300 cells
NOTICE:  [geometry_estimate.c:estimate_selectivity:604]  SUM(ov_histo_cells)=0.096025
NOTICE:  [geometry_estimate.c:estimate_selectivity:605]  gain=0.999700
NOTICE:  [geometry_estimate.c:estimate_selectivity:606]  selectivity=0.095996
NOTICE:  [geometry_estimate.c:geometry_gist_sel_2d:754]  returning computed value: 0.095996

comment:18 Changed 6 years ago by strk

The estimate for the geometry version is actually pretty good. Estimated: 1920, actual: 2001.

comment:19 Changed 6 years ago by strk

And now for geography:

strk=# SELECT count(id) FROM _test_pos WHERE ST_DWithin(pos, 'POINT(7 7)', 300000);
NOTICE:  [geography_estimate.c:geography_gist_selectivity:464] geography_gist_selectivity called
NOTICE:  [geography_estimate.c:estimate_selectivity:163]  histogram has 12 unitsx, 34 unitsy, 36 unitsz
NOTICE:  [geography_estimate.c:estimate_selectivity:164]  histogram geosize is 0.116978x0.321376x0.342003
NOTICE:  [geography_estimate.c:estimate_selectivity:401]  search_box overlaps 1.000000 cells
NOTICE:  [geography_estimate.c:estimate_selectivity:402]  avg feat overlaps 1.000000 cells
NOTICE:  [geography_estimate.c:estimate_selectivity:414]  SUM(ov_histo_cells)=0.000000
NOTICE:  [geography_estimate.c:estimate_selectivity:415]  gain=1.000000
NOTICE:  [geography_estimate.c:estimate_selectivity:416]  selectivity=0.000000
NOTICE:  [geography_estimate.c:geography_gist_selectivity:572]  returning computed value: 0.000000
NOTICE:  [geography_estimate.c:geography_gist_selectivity:464] geography_gist_selectivity called
NOTICE:  [geography_estimate.c:geography_gist_selectivity:507]  no variable argument ? - returning default selectivity
 count 
-------
  3839
(1 row)

The "search_box overlaps 1 cells" is suspicius, isn't it ?

comment:20 Changed 6 years ago by strk

Just a final note, then I'll leave this to Mark: we want to make automated tests for estimation !!

comment:21 Changed 6 years ago by robe

Milestone: PostGIS 2.0.1PostGIS 2.0.2

Mark I'm moving to 2.0.2. Move back if you will get to it in the next day and its an easy fix.

comment:22 Changed 6 years ago by realityexists

It would be great to get this fixed soon, because we currently have to work around it by removing the RESTRICT and JOIN clauses for &&, which not only requires a manual hack, but may negatively impact performance in other cases.

comment:23 Changed 6 years ago by mcayland

One of the delays on this is that my dev environment is set up for 9.0 and so WITH... INSERT doesn't work. I've used the following as a substitute:

CREATE OR REPLACE FUNCTION create1828() RETURNS text AS $$

DECLARE

cnt int; x float;

BEGIN

FOR x IN SELECT generate_series(1, 20000)::float / 1000 LOOP

-- RAISE NOTICE 'x is %', x; INSERT INTO _test_pos(pos) SELECT ST_MakePoint(x, x)::geography;

END LOOP;

RETURN 'done';

END;

$$ LANGUAGE 'plpgsql';

Now I have the actual sample data, I'll have a poke around and see if I can figure out what is happening.

comment:24 Changed 6 years ago by mcayland

Okay I've finally figured out what's going on here. The key to solving this was to look at the debug output for the geography plan:

postgis20=# explain analyze SELECT id FROM _test_pos WHERE ST_DWithin(pos, 'POINT(7 7)', 300000);
NOTICE:  [geography_estimate.c:geography_gist_selectivity:464] geography_gist_selectivity called
NOTICE:  [geography_estimate.c:geography_gist_selectivity:529]  requested search box is : 0.985147863137998 0.120960947799834 0.121869343405147, 0.985147863137998 0.120960947799834 0.121869343405147
NOTICE:  [geography_estimate.c:geography_gist_selectivity:557]  16013 read from stats
NOTICE:  [geography_estimate.c:geography_gist_selectivity:559]  histo: xmin,ymin,zmin: 0.883022,0.000017,0.000017
NOTICE:  [geography_estimate.c:geography_gist_selectivity:560]  histo: xmax,ymax: 1.000000,0.321394,0.342020
NOTICE:  [geography_estimate.c:geography_gist_selectivity:561]  histo: unitsx: 12.000000
NOTICE:  [geography_estimate.c:geography_gist_selectivity:562]  histo: unitsy: 34.000000
NOTICE:  [geography_estimate.c:geography_gist_selectivity:563]  histo: unitsz: 36.000000
NOTICE:  [geography_estimate.c:geography_gist_selectivity:564]  histo: avgFeatureCoverage: 0.000000
NOTICE:  [geography_estimate.c:geography_gist_selectivity:565]  histo: avgFeatureCells: 1.000000
NOTICE:  [geography_estimate.c:estimate_selectivity:163]  histogram has 12 unitsx, 34 unitsy, 36 unitsz
NOTICE:  [geography_estimate.c:estimate_selectivity:164]  histogram geosize is 0.116978x0.321376x0.342003
NOTICE:  [geography_estimate.c:estimate_selectivity:348]  [10,12,12] cell val 0.026949999853969
NOTICE:  [geography_estimate.c:estimate_selectivity:350]  [10,12,12] AOI 0.000000000000000
NOTICE:  [geography_estimate.c:estimate_selectivity:352]  [10,12,12] gain 0.000000000000000
NOTICE:  [geography_estimate.c:estimate_selectivity:357]  [10,12,12] adding 0.000000000000000 to value
NOTICE:  [geography_estimate.c:estimate_selectivity:401]  search_box overlaps 1.000000 cells
NOTICE:  [geography_estimate.c:estimate_selectivity:402]  avg feat overlaps 1.000000 cells
NOTICE:  [geography_estimate.c:estimate_selectivity:414]  SUM(ov_histo_cells)=0.000000
NOTICE:  [geography_estimate.c:estimate_selectivity:415]  gain=1.000000
NOTICE:  [geography_estimate.c:estimate_selectivity:416]  selectivity=0.000000
NOTICE:  [geography_estimate.c:geography_gist_selectivity:572]  returning computed value: 0.000000
NOTICE:  [geography_estimate.c:geography_gist_selectivity:464] geography_gist_selectivity called
NOTICE:  [geography_estimate.c:geography_gist_selectivity:507]  no variable argument ? - returning default selectivity
                                                                                                                                                          QUERY PLAN                                                             
                                                                                              
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
----------------------------------------------------------------------------------------------
 Seq Scan on _test_pos  (cost=0.00..5578.00 rows=1 width=4) (actual time=17.973..32.506 rows=3839 loops=1)
   Filter: ((pos && '0101000020E61000000000000000001C400000000000001C40'::geography) AND ('0101000020E61000000000000000001C400000000000001C40'::geography && _st_expand(pos, 300000::double precision)) AND _st_dwithin(pos, '010
1000020E61000000000000000001C400000000000001C40'::geography, 300000::double precision, true))
 Total runtime: 32.734 ms
(3 rows)

postgis20=# select st_astext('0101000020E61000000000000000001C400000000000001C40'::geography);
 st_astext  
------------
 POINT(7 7)
(1 row)

The problem here is that the original query point is being passed directly into the first part of the filter rather than the *expanded* version - hence why the selectivity is always tiny when the distance is large.

Poking around in geography_expand() I see this comment:

/*
** geography_expand(GSERIALIZED *g) returns *GSERIALIZED
**
** warning, this tricky little function does not expand the
** geometry at all, just re-writes bounding box value to be
** a bit bigger. only useful when passing the result along to
** an index operator (&&)
*/

So our ST_Expand(geog, distance) is returning the same original geography but with an inflated bounding box. BUT in geography_gist_selectivity() we do this:

	/* Convert coordinates to 3D geodesic */
	FLAGS_SET_GEODETIC(search_box.flags, 1);
	if (!lwgeom_calculate_gbox_geodetic(geometry, &search_box))
	{
		POSTGIS_DEBUG(3, " search box cannot be calculated");

		PG_RETURN_FLOAT8(0.0);
	}

So what happens is that when we generate our search box, lwgeom_calculate_gbox_geodetic() iterates through all of the pointarrays within the geography to calculate the bounding box for search - but because geography_expand() didn't alter the geography (just its bounding box) we therefore end up back with our unexpanded input geography once again.

I think we have two options here:

1) Change geography_gist_selectivity() to directly pull out the BBOX within the geography instead of recalculating it

2) Write a proper ST_Expand(geography) function

I'm inclined to lean towards 2) because with 1) in place, the query plan will still show the original geography being passed into the && clause which is the confusion that got us here in the first place.

comment:25 Changed 6 years ago by strk

Are you saying that ST_AsText(ST_Expand('POINT(0 0)'::geography)) would return 'POINT(0 0)' ? If that's the case I'd consider it a bad bug indeed.

That said, wouldn't it be faster just to always use the cached bbox rather than recomputing it ? Or is there any special need to recompute for geography ? Or is it the old double vs. float story ?

comment:26 Changed 6 years ago by mcayland

And just to confirm this:

postgis20=# select _st_expand('0101000020E61000000000000000001C400000000000001C40'::geography, 8000);
                     _st_expand                     
----------------------------------------------------
 0101000020E61000000000000000001C400000000000001C40
(1 row)

I guess that's why there is no official ST_Expand(geography, distance) function. Paul - I think we need some input from you on this one.

comment:27 Changed 6 years ago by strk

Ok, if it's an "internal" one I've no problem with what it does (although it does sound dangerous) -- Yep, Paul's word is needed to know how to proceed (I think using the cache would be good)

comment:28 Changed 6 years ago by robe

strk I recall complaining about this very thing to pramsey. I thought the issue was that there is no way to represent the bounding box for geography as a geography at least that was the case in 1.5) so you really couldn't output it without doing a lengthy computation. That may have changed though with our new underlying format.

What cached bbox are you talking about? certainly not the one for POINT(0 0 ) since that would be wrong.

comment:29 Changed 6 years ago by strk

I'm talking about the cached _expanded_ box.

comment:30 Changed 6 years ago by robe

here is the reference to the ticket where pramsey explains: #894

Changed 6 years ago by pramsey

Attachment: boxcalc.patch added

Pick the box out of the serialization rather than calculating it from scratch.

comment:31 Changed 6 years ago by pramsey

OK, my "fix" is attached. It fixes the problem Mark identified, which is that we recalculate the bounding box from scratch during the statistics check, but it doesn't seem to fix the actual selectivity issue yet.

Changed 6 years ago by pramsey

Attachment: boxcalc2.patch added

Cleaner patch to read the cached box.

comment:32 Changed 6 years ago by pramsey

The patch does fix things, in that the query box now has a non-zero extent... however the estimate is still way off.

postgis20=# explain analyze SELECT id FROM _test_pos WHERE ST_DWithin(pos, 'POINT(7 7)'::geography, 300000);
NOTICE:   requested search box is : 0.938059508800507 0.0738726407289505 0.0747810378670692, 1.03223621845245 0.168049246072769 0.168957650661469
                                                                                                                                                          QUERY PLAN                                                                                                                                                           
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Seq Scan on _test_pos  (cost=0.00..5578.00 rows=1 width=4) (actual time=17.239..67.544 rows=3839 loops=1)
   Filter: ((pos && '0101000020E61000000000000000001C400000000000001C40'::geography) AND ('0101000020E61000000000000000001C400000000000001C40'::geography && _st_expand(pos, 300000::double precision)) AND _st_dwithin(pos, '0101000020E61000000000000000001C400000000000001C40'::geography, 300000::double precision, true))

Changed 6 years ago by pramsey

Attachment: boxcalc3.patch added

Also pull the cached box in the stats calculation stage

comment:33 Changed 6 years ago by pramsey

With my patches in place, I'd say there's something else going on...

I make a call to run the function...

postgis20=# explain analyze SELECT id FROM _test_pos WHERE ST_DWithin(pos, 'POINT(7 7)'::geography, 300000);

geography_gist_selectivity is called, great. It pulls the geog stats, runs the calculations, and returns a DECENT SELECTIVITY NUMBER, 0.281997. But then, look at the very bottom, it's called a SECOND TIME, and this time with arguments that make it return the DEFAULT selectivity.

NOTICE:  [geography_estimate.c:geography_gist_selectivity:495] geography_gist_selectivity called
NOTICE:   requested search box is : 0.938059508800507 0.0738726407289505 0.0747810378670692, 1.03223621845245 0.168049246072769 0.168957650661469
NOTICE:  [geography_estimate.c:geography_gist_selectivity:563]  requested search box is : 0.938059508800507 0.0738726407289505 0.0747810378670692, 1.03223621845245 0.168049246072769 0.168957650661469
NOTICE:  [geography_estimate.c:geography_gist_selectivity:591]  16013 read from stats
NOTICE:  [geography_estimate.c:geography_gist_selectivity:593]  histo: xmin,ymin,zmin: 0.883022,0.000017,0.000017
NOTICE:  [geography_estimate.c:geography_gist_selectivity:594]  histo: xmax,ymax: 1.000000,0.321394,0.342020
NOTICE:  [geography_estimate.c:geography_gist_selectivity:595]  histo: unitsx: 12.000000
NOTICE:  [geography_estimate.c:geography_gist_selectivity:596]  histo: unitsy: 34.000000
NOTICE:  [geography_estimate.c:geography_gist_selectivity:597]  histo: unitsz: 36.000000
NOTICE:  [geography_estimate.c:geography_gist_selectivity:598]  histo: avgFeatureCoverage: 0.000000
NOTICE:  [geography_estimate.c:geography_gist_selectivity:599]  histo: avgFeatureCells: 1.000000
NOTICE:  
NOTICE:  [geography_estimate.c:estimate_selectivity:196]  histogram has 12 unitsx, 34 unitsy, 36 unitsz
NOTICE:  [geography_estimate.c:estimate_selectivity:197]  histogram geosize is 0.116978x0.321376x0.342003
NOTICE:  [geography_estimate.c:estimate_selectivity:381]  [5,7,7] cell val 0.000000000000000
NOTICE:  [geography_estimate.c:estimate_selectivity:383]  [5,7,7] AOI 0.000000007526458
NOTICE:  [geography_estimate.c:estimate_selectivity:385]  [5,7,7] gain 0.009150018534462
.........
NOTICE:  [geography_estimate.c:estimate_selectivity:381]  [11,17,17] cell val 0.000000000000000
NOTICE:  [geography_estimate.c:estimate_selectivity:383]  [11,17,17] AOI 0.000000532528078
NOTICE:  [geography_estimate.c:estimate_selectivity:385]  [11,17,17] gain 0.647401718907906
NOTICE:  [geography_estimate.c:estimate_selectivity:390]  [11,17,17] adding 0.000000000000000 to value
NOTICE:  [geography_estimate.c:estimate_selectivity:434]  search_box overlaps 847.000000 cells
NOTICE:  [geography_estimate.c:estimate_selectivity:435]  avg feat overlaps 1.000000 cells
NOTICE:  [geography_estimate.c:estimate_selectivity:447]  SUM(ov_histo_cells)=0.281997
NOTICE:  [geography_estimate.c:estimate_selectivity:448]  gain=1.000000
NOTICE:  [geography_estimate.c:estimate_selectivity:449]  selectivity=0.281997
NOTICE:  [geography_estimate.c:geography_gist_selectivity:609]  returning computed value: 0.281997
NOTICE:  [geography_estimate.c:geography_gist_selectivity:495] geography_gist_selectivity called
NOTICE:  [geography_estimate.c:geography_gist_selectivity:538]  no variable argument ? - returning default selectivity

                                                                                                                                                          QUERY PLAN                                                                                                                                                           
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Seq Scan on _test_pos  (cost=0.00..5578.00 rows=1 width=4) (actual time=16.487..66.990 rows=3839 loops=1)
   Filter: ((pos && '0101000020E61000000000000000001C400000000000001C40'::geography) AND ('0101000020E61000000000000000001C400000000000001C40'::geography && _st_expand(pos, 300000::double precision)) AND _st_dwithin(pos, '0101000020E61000000000000000001C400000000000001C40'::geography, 300000::double precision, true))
 Total runtime: 67.736 ms
(3 rows)

Our double-sided expand trick is not doing what we want, it's not picking the good selectivity and discarding the other, it's picking the other.

comment:34 Changed 6 years ago by pramsey

And so if you change the estimate for <const> && <const> to 1.0 (geography_estimate.c:540), you get the correct estimate on our overall DWithin query.

	/*
	 * We are working on two constants..
	 * TODO: check if expression is true,
	 *       returned set would be either
	 *       the whole or none.
	 */
	if ( ! IsA(self, Var) )
	{
		POSTGIS_DEBUG(3, " no variable argument ? - returning default selectivity");

		PG_RETURN_FLOAT8(1.0);
	}

comment:35 Changed 6 years ago by mcayland

Right - the reason there are 2 calls to the function are because there are 2 && operators in the rewritten query. The first is <col> && <const> (which we can at least attempt an estimate) and the second is <const> && expand(<col>) which we can't do much about because the function is effectively a black box to the planner, hence we bail out returning the default estimate.

I'm thinking that the reason we are underestimating is because in 3 dimensions we intersect a larger number of potential histogram boxes within 3D space, but they are mostly empty due to the curvature of the spheroid which drags avg_feat_cells unusually low. This will need some testing to verify this though. Therefore for geography I think we either need to determine how to alter the mathematics to handle this, or possibly even consider converting the inputs to 2D and using something similar to the existing geometry code.

comment:36 Changed 6 years ago by pramsey

The problem is that for <const> && <const> we're returning DEFAULT_SELECTIVITY, which is very high. So the <const> && <const> term is washing out the other one, the one we care about. As the comment in the code block notes, the <const> && <const> case should return either 0 or 1, depending on the actual evaluation of the statement, but not something in between. I'm wondering if returning 1 might actually be the path of "less harm" in a temporary hack, rather than returning DEFAULT_SELECTIVITY.

comment:37 Changed 6 years ago by strk

shouldn't the histogram be a cube for 3d ops ?

another thought is: could be make expand() inlined to allow for better estimation ?

comment:38 Changed 6 years ago by mcayland

Bear in mind that the comment block is misleading - we're not bailing out because we have <const> && <const> because there is no such clause in the generated SQL. If you look closer at the code beforehand, the reason we're dropping into that section is because we have something that's not <var> && <const> or <const> && <var>, which in this case is caused by <const> && <func> created by the _st_expand(pos, 300000) clause.

The only reason this works for your case above is because it just so happens that 300000 covers the entire histogram area and so a selectivity of 1.0 is an appropriate value. If you change this to, say 300, then the selectivity will become inaccurate once again.

I wonder if we can create a slightly modified _st_expand() function with different prototypes for different arguments, e.g.

_st_expand(const, distance) -> returns the geography with expanded box as it (should) do at the moment

_st_expand(col, distance) -> uses the histogram stats to return the bounding box for the given column and then expand it by distance

If we mark this second function as IMMUTABLE then it should get folded into a constant at plan time, and then we can use this to set search_box by detecting this <const> && <const> condition which might stand a chance of returning something slightly meaningful.

comment:39 Changed 6 years ago by strk

can't we return "dunno" from the estimator, for a <const> && <const> so not to be considered by the planner ?

comment:40 Changed 6 years ago by strk

sorry, for the <const> && <func> (worth an update of that comment and the addition of another block for the <const> && <func> with a more meaningful TODO paragraph)

comment:41 Changed 6 years ago by pramsey

I don't think your second _st_expand(col, distance) example makes sense, since the expanded version is not necessarily the extent of the whole column, it's the extent of the things that pass the other filters in the query.

comment:42 Changed 6 years ago by mcayland

That's not quite true: if you have a query of the form <foo> && <bar> AND <bat> && <baz> then both pairs <foo> && <bar> and <bat> && <baz> will be passed to the selectivity function by the planner in order to find out which one is the cheapest, i.e. the one with the lowest cost will be executed first and then the other clauses will be applied as subsequent AND filters to the resulting dataset. In other words, the selectivity itself specifies the execution order. This why hardcoding it as a low constant for <foo> && <bar> in old versions of PostGIS forced the GiST index to be used all of the time because the default value was so small compared to any of the inbuilt estimates.

Hence if the result of <const> && _st_expand(pos, 300000) returns a lower value then <const> && <var> then it should be chosen first in preference to the <const> && <var> clause. Hardcoding the latter to 1.0 is effectively telling the planner that this second clause is likely to return the entire table, and hence effectively forces <const> && <var> to look the most favourable clause to execute first. So thinking about this particular case again, I think you are right in that you could easily argue that for an ST_DWithin() query <const> && <var> is always going to be better (smaller) than <const> && <expanded var>, but bear in mind changing this is going to have an unknown knock-on effect for existing applications using && within custom queries as _st_expand() isn't always the target function.

Upon reflection, perhaps the least invasive fix in terms of application performance is to teach the geography estimater specifically about <const> && _st_expand() using an oid function lookup, and hence return a selectivity that is slightly higher (and hence less attractive) than the plain old <const> && <var> selectivity.

comment:43 Changed 6 years ago by pramsey

I don't think this problem is exclusively about geography, the same code can be found in the geometry estimate section, and there was a demonstration higher in the ticket that it affects geometry too.

comment:44 Changed 6 years ago by strk

I guess you're right about it affecting geometry as well, but couldn't find the demonstration (which comment?).

Teaching _st_expand() to the estimator sounds like an interesting project.

Any chance to reuse more code between geometry and geography, to do less work ? Also, I think it's _very_ important to add some form of regression testing for this.

comment:45 Changed 6 years ago by strk

I'm thinking: do we really need to have both VAR && Expand(QUERY) _and_ QUERY && Expand(VAR) ? Are we sure that's conceptually correct to do for ST_DWithin ?

comment:46 Changed 6 years ago by realityexists

Any update on this? Is it still planned for 2.0.2?

comment:47 Changed 6 years ago by robe

pramsey and this one. Sounds like realityexists really wants this and he's been such a good tester that we should reward him by fixing this or giving him a metal.

comment:48 Changed 6 years ago by pramsey

OK, back looking at this *particular* problem, avoiding thinking about handling <func>, just trying to get the selectivity in this case working better...

First question: can we just avoid returning an answer when we are in a "don't know" situation. Answer: No. The caller is expecting a valid number, and will error out if we don't deliver (plancat.c:1033):

	if (result < 0.0 || result > 1.0)
		elog(ERROR, "invalid restriction selectivity: %f", result);

Second question: surely the PgSQL planner has to deal with unknowns! What do they do? Answer: Well, there's an example one caller up from plancat.c, in (clausesel.c:683)

	else if (is_funcclause(clause))
	{
		/*
		 * This is not an operator, so we guess at the selectivity. THIS IS A
		 * HACK TO GET V4 OUT THE DOOR.  FUNCS SHOULD BE ABLE TO HAVE
		 * SELECTIVITIES THEMSELVES.	   -- JMH 7/9/92
		 */
		s1 = (Selectivity) 0.3333333;
	}

So, they return a mid-range value when they don't know, presumably to let other extreme values one way or the other weigh in first.

comment:49 Changed 6 years ago by pramsey

Resolution: fixed
Status: newclosed

OK, going for the simple patch at r10718 on 2.0 and r10719 on trunk. Regression tests will have to go on trunk, it's going to involve some new SQL functions (#2101) so we can call into our selectivity calculations from SQL.

comment:50 Changed 6 years ago by strk

What about a simple EXPLAIN ANALYZE based test ? I think we can force the analyzer to get the whole dataset in the histogram so that the expected number of results is predictable (rather than based on a smaller sample)

comment:51 Changed 6 years ago by strk

I love the comment, btw :)

comment:52 Changed 6 years ago by realityexists

Resolution: fixed
Status: closedreopened

Tried r10728 on Linux (32-bit), but it seems to be completely broken: whenever VACUUM runs (manually or automatically) PostgreSQL crashes. I don't have a repro ready for this, but attaching the backtrace and debug log, hopefully that helps.

Changed 6 years ago by realityexists

Attachment: backtrace-r10728.txt added

Changed 6 years ago by realityexists

Attachment: log-r10728.txt added

comment:53 Changed 6 years ago by pramsey

So, sorry, doing too many things at once... fixed in trunk at r10729 and 2.0 at r10730

Changed 6 years ago by realityexists

Attachment: backtrace-r10729.txt added

Crash as of r10729

comment:54 Changed 6 years ago by realityexists

Sorry, it still crashes, though at a different point in my code (and in yours)

comment:55 Changed 6 years ago by pramsey

Can you give me a test table that does it? My little ones are working. Sorry about this!

comment:56 Changed 6 years ago by realityexists

Here you go:

CREATE TABLE _test_point AS
SELECT 'POINT(180 50)'::geography AS pos;

ANALYZE _test_point;

comment:57 Changed 6 years ago by pramsey

I made a lot of careless mistakes in that sequence, but the final problem ended up being quite subtle. r10731 I didn't fix it, I just reverted it. It's tricky and sensitive. The geodetic bounding box of a point, when converted to float and stored in the gserialized is slightly larger than a point. For a one-entry table, that means the histogram is not just a single box, but something bigger. Math explodes, unpleasantness ensues.

comment:58 Changed 6 years ago by realityexists

OK, tested on r10731 on 32-bit Linux. It doesn't crash any more - good start! A simplified version of my original query is now twice as fast with the filtering function inlined compared to not inlined. The real, original query is still about 20% slower with inlining than without. That's a big improvement on before (when it didn't return at all), but I guess it's still largely a matter of luck.

I can't compare the absolute times before and after, because my "main" Postgres installation (on Windows x64) is still at r10577. I'm holding off upgrading that until the "180/-180" issue in #2066 is fixed.

comment:59 Changed 6 years ago by robe

Owner: changed from mcayland to pramsey
Status: reopenednew

comment:60 Changed 6 years ago by pramsey

Resolution: fixed
Status: newclosed

I'm closing this out, I think the fix is as good as we're going to get on stable branch. Doing a function-based analyze I leave as an exercise to brighter folks.

Note: See TracTickets for help on using tickets.