Opened 8 years ago

Closed 7 years ago

#3753 closed defect (fixed)

Gist penalty misbehaves if points are inserted in gridded ordered fashion

Reported by: komzpa Owned by: pramsey
Priority: blocker Milestone: PostGIS 2.4.0
Component: postgis Version: 2.3.x
Keywords: Cc:

Description

If data can be perceived as strips, gist penalty will always be zero, since leaves will be (width x height) = 0 x H = 0. For 2D data in ND index it will always be 0.

There is patch by x4m for cubes that solves this problem by returning expansion of lesser dimension (length instead of volume etc.) with special mark:

https://github.com/x4m/postgres_g/commit/76972490adb756e02df07f28723951f269729fcc

Change History (17)

comment:1 by komzpa, 8 years ago

I've attempted to address it in https://github.com/postgis/postgis/pull/129 - for some reason Travis fails to build Postgis (debian dependency problem).

comment:2 by robe, 8 years ago

Owner: changed from pramsey to robe

I think I had the same issue when doing a ubuntu upgrade recently but forget how I got around the gdal issue.

Anyway as you said it's a gdal install issue nothing to do with your patch. I'll take a look at your patch and commit if it passes my 9.4-9.6 tests.

comment:3 by robe, 8 years ago

Milestone: PostGIS 2.3.3PostGIS 2.4.0

I'll commit in a bit. Pushing this to 2.4 for now since it's my understanding it's a performance boost and not a bug. May consider backporting later after I've done some tests that Andrey Borodin suggested. I'll keep this open until I've done some of those tests.

comment:4 by robe, 8 years ago

In 15392:

Gist penalty speed improvement for 2d and nd points
References #3753 for PostGIS 2.4.0

comment:5 by robe, 8 years ago

For completeness note from Andrey Borodin on pull request:

@robe2 1.5x and 3x was an improvement with old Guttman split algorithm. Since PostGIS is using one of the most advanced split algorithms, that split algorithm may be fixing data issues being solved by this patch.
If you can benchmark I'd recommend something like this

create extension if not exists cube;

begin transaction;
SELECT setseed(.43);

create table dataTable as select cube(array[x/1000,y/1000]) c from generate_series(1,1e3,1)x, generate_series(1,1e3,1) y;

\timing
create index idx on dataTable using gist(c);

select q,(select count(*) from dataTable dt where dt.c<@q) from (select cube(array[x,y],array[x+0.1,y+0.1]) q from (select random() x,random() y from generate_series(1,1e4,1) s0) s1) s2 ;

select pg_size_pretty(pg_relation_size('idx'));

drop table datatable;

Replace cube with appropriate PostGIS type. Here we create grid of 1M points and do 10K searches in this grid. If search time is not reduced significantly, may be this trick is not usefull.

comment:6 by robe, 8 years ago

PostGIS equivalent from Darafei Praliaskouski

create extension if not exists postgis;

begin transaction;
SELECT setseed(.43);

create table dataTable as select ST_MakePoint(x/1000,y/1000) c from generate_series(1,1e3,1)x, generate_series(1,1e3,1) y;

\timing on
create index idx on dataTable using gist(c);

select q,(select count(*) from dataTable dt where dt.c && q) from (select ST_Expand(ST_MakePoint(x,y), 0.1) q from (select random() x, random() y from generate_series(1,1e4,1) s0) s1) s2 ;

select pg_size_pretty(pg_relation_size('idx'));

drop table datatable;

comment:7 by robe, 8 years ago

I put this note on pull request as well. I'll keep this open, since we may need to roll the change back:

This trick doesn't seem useful based on my cursory tests. The index building took longer, size of index is the same, and no gain in performance.

I tested PostGIS 2.3.2 on 9.6 64-bit window 7 and got these timings using @Komzpa tests and then repeated using PostGIS 2.4.0dev that has this patch. The index size are the same and don't see much of a difference in speed, but 2.4 seems slightly worse. Hard to tell absolutely with random data and I won't rule out cause is other changes in 2.4 code base, though I don't think we've touched the gist section except for this.

Time 1: index build
Time 2: query
Time 3: index size check
Time 4: drop table

PostGIS 2.3 test results:

First run:

Time: 14299.226 ms
Time: 63453.484 ms
Time: 12.544 ms
Time: 4.672 ms

Second run:

Time: 12983.263 ms
Time: 62485.045 ms
Time: 0.454 ms
Time: 7.881 ms

Third run:

Time: 13205.724 ms
Time: 63193.318 ms
Time: 3.722 ms
Time: 8.986 ms
Time: 62.018 ms

`

PostGIS 2.4 results:


First run:

Time: 18914.431 ms
Time: 64785.811 ms
Time: 7.872 ms
Time: 7.698 ms

Second run:

Time: 14776.264 ms
Time: 64393.529 ms
Time: 7.124 ms
Time: 6.232 ms

Third run:

Time: 14909.494 ms
Time: 64252.608 ms
Time: 11.676 ms
Time: 6.020 ms

@Komzpa did you get a chance to test this?

comment:8 by robe, 8 years ago

Guess we should do the same for nd version to see if that is different:

create extension if not exists postgis;

begin transaction;
SELECT setseed(.43);

create table dataTable as select ST_MakePoint(x/1000,y/1000, z/1000) c from generate_series(1,1e2,1)x, generate_series(1,1e2,1) y, generate_series(1,100000,1000) z;

\timing on
create index idx on dataTable using gist(c gist_geometry_ops_nd);

select q,(select count(*) from dataTable dt where dt.c &&& q) from (select ST_Expand(ST_MakePoint(x,y,z), 0.1) q from (select random() x, random() y , random() z from generate_series(1,1e4,1) s0) s1) s2 ;

select pg_size_pretty(pg_relation_size('idx'));
\set timing off
drop table datatable;

I did only 2 runs for 3d, but not seeing an improvement

index size in both is 79 MB

PostGIS 2.3

first run:

   Time: 13461.306 ms
    Time: 24551.813 ms
    Time: 3.859 ms
    Time: 1.549 ms
    Time: 121.835 ms 
    
second run:

    Time: 13694.926 ms
    Time: 24591.686 ms
    Time: 4.425 ms
    Time: 1.634 ms
    Time: 117.875 ms
    
 
    
PostGIS 2.4

first run:

    Time: 15111.942 ms
    Time: 23063.534 ms
    Time: 3.645 ms
    Time: 8.082 ms
    Time: 104.651 ms

second run:

    Time: 14967.676 ms
    Time: 25793.583 ms
    Time: 4.122 ms
    Time: 2.091 ms
    Time: 107.363 ms
Last edited 8 years ago by robe (previous) (diff)

comment:9 by x4m, 8 years ago

Hi!

I've done the way of installing PostGIS :) My test machine: Ubuntu 16.04 under Hyper-V PostgreSQL 10devel at HEAD around Feb 2017 I've been testing PostGIS rev 15391 and 15392 (before and after). I confirm that previous case test do not have statistically significant difference. One more important observation is that performance is slightly increased if we remove lines

		if (size_orig > 0) 
		{
			*result = pack_float(size_orig, 1); /* REALM 1 */ 
		}
		else 

This test with less general case

create extension if not exists postgis;

create table dataTable as select ST_MakePoint(x/1000,y/1000, 1) c from generate_series(1,1e3,1)x, generate_series(1,1e3,1) y;

\timing on
create index idx on dataTable using gist(c gist_geometry_ops_nd);

begin transaction;
SELECT setseed(.43);
explain analyze select q,(select count(*) from dataTable dt where dt.c &&& q) from (select ST_Expand(ST_MakePoint(x,y,z), 0.1) q from (select random() x, random() y , 0.95 z from generate_series(1,1e3,1) s0) s1) s2 ;
commit transaction;

select pg_size_pretty(pg_relation_size('idx'));
\set timing off
drop table datatable;

sees performance improvement. Before patch selection takes 300 seconds (on my machine) after patch selection takes 37 seconds.

In this test we have 2D data in 3D. This scenario is quite rare for 2d version: all points have to be on the same grid aligned line.

I think in cube I was observing performance improvement attributed to data features, which are already exploited here by Korotkov split. So this patch only fixes degenerate cases of 2D data in ND.

In March 2017 on PgConf.Russia @Komzpa showed me performance problems with data inserted into GiST in-order by stips. This patch should have fixed that, but, actually, as for now, I do not observe that performance problems in reproducible environment.

comment:10 by x4m, 8 years ago

@Komzpa pinged me with a new test with integer grid for 2D

begin transaction;
SELECT setseed(.43);

create table dataTable as select ST_MakePoint(x,y) c from generate_series(1,1e3,1)x, generate_series(1,1e3,1) y;

\timing on
create index idx on dataTable using gist(c);

select q,(select count(*) from dataTable dt where dt.c && q) from (select ST_Expand(ST_MakePoint(x,y), 100) q from (select random()*1000 x, random()*1000 y from generate_series(1,3e3,1) s0) s1) s2 ;

select pg_size_pretty(pg_relation_size('idx'));

drop table datatable;

Without patch I observe index with 43MB size and 109.2 seconds for select. With patch index size is 56MB and select time 68.2 seconds. Not a radical performance improvement, but still viable.

I still suggest dropping lines

		if (size_orig > 0) 
		{
			*result = pack_float(size_orig, 1); /* REALM 1 */ 
		}
		else 

They were part of original patch, but in PostGIS I observe several percents of performance improvement without them. If necessary, I can make more precise statistical estimates.

comment:11 by komzpa, 8 years ago

I've made two visualizations of line-shaped leaves in index, attached to github PR:

https://github.com/postgis/postgis/pull/129#issuecomment-301413899

To make the visualization yourself, you can follow this instruction: https://gist.github.com/Komzpa/f9e1c241d1d2a27cf5a4f985f646f8a6

comment:12 by x4m, 8 years ago

Well, seems that trick works for certain cases. Can I ask to mention me in header comments? If the patch will be accepted, I'll put it on my CV :)

comment:13 by pramsey, 7 years ago

Owner: changed from robe to pramsey

comment:14 by robe, 7 years ago

So I think all that is left before we can close this ticket is to perhaps drop the lines you referred to and some retesting to confirm it improves things.

comment:15 by robe, 7 years ago

Priority: mediumblocker

in reply to:  14 comment:16 by x4m, 7 years ago

Sorry, I was a bit busy these days.

Replying to robe:

So I think all that is left before we can close this ticket is to perhaps drop the lines you referred to and some retesting to confirm it improves things.

Previously I observed following

Before patch selection takes 300 seconds (on my machine) after patch selection takes 37 seconds… 2D data in 3D

integer grid for 2D … index with 43MB size and 109.2 seconds for select. With patch index size is 56MB and select time 68.2 seconds

This constitute statistically significant performance improvement. Do you want to restart these tests again?

As to

drop the lines you referred to

In May, we have made tests with @Komzpa and found no improvement in dropping the lines. Sorry for not putting that information to the discussion, we were pinging pictures and gevel images to spot the difference(induced by dropping the lines) , but found none.

From my point of view, current state is:

  1. Best we can achieve by penalty function tuning
  2. Contains sufficient improvement over revision 15391

comment:17 by robe, 7 years ago

Resolution: fixed
Status: newclosed

Okay I'll close this out then as done thanks for getting back to me.

Regina

Note: See TracTickets for help on using tickets.