Opened 8 years ago
Closed 7 years ago
#3753 closed defect (fixed)
Gist penalty misbehaves if points are inserted in gridded ordered fashion
Reported by: | komzpa | Owned by: | pramsey |
---|---|---|---|
Priority: | blocker | Milestone: | PostGIS 2.4.0 |
Component: | postgis | Version: | 2.3.x |
Keywords: | Cc: |
Description
If data can be perceived as strips, gist penalty will always be zero, since leaves will be (width x height) = 0 x H = 0. For 2D data in ND index it will always be 0.
There is patch by x4m for cubes that solves this problem by returning expansion of lesser dimension (length instead of volume etc.) with special mark:
https://github.com/x4m/postgres_g/commit/76972490adb756e02df07f28723951f269729fcc
Change History (17)
comment:1 by , 8 years ago
comment:2 by , 8 years ago
Owner: | changed from | to
---|
I think I had the same issue when doing a ubuntu upgrade recently but forget how I got around the gdal issue.
Anyway as you said it's a gdal install issue nothing to do with your patch. I'll take a look at your patch and commit if it passes my 9.4-9.6 tests.
comment:3 by , 8 years ago
Milestone: | PostGIS 2.3.3 → PostGIS 2.4.0 |
---|
I'll commit in a bit. Pushing this to 2.4 for now since it's my understanding it's a performance boost and not a bug. May consider backporting later after I've done some tests that Andrey Borodin suggested. I'll keep this open until I've done some of those tests.
comment:5 by , 8 years ago
For completeness note from Andrey Borodin on pull request:
@robe2 1.5x and 3x was an improvement with old Guttman split algorithm. Since PostGIS is using one of the most advanced split algorithms, that split algorithm may be fixing data issues being solved by this patch. If you can benchmark I'd recommend something like this create extension if not exists cube; begin transaction; SELECT setseed(.43); create table dataTable as select cube(array[x/1000,y/1000]) c from generate_series(1,1e3,1)x, generate_series(1,1e3,1) y; \timing create index idx on dataTable using gist(c); select q,(select count(*) from dataTable dt where dt.c<@q) from (select cube(array[x,y],array[x+0.1,y+0.1]) q from (select random() x,random() y from generate_series(1,1e4,1) s0) s1) s2 ; select pg_size_pretty(pg_relation_size('idx')); drop table datatable; Replace cube with appropriate PostGIS type. Here we create grid of 1M points and do 10K searches in this grid. If search time is not reduced significantly, may be this trick is not usefull.
comment:6 by , 8 years ago
PostGIS equivalent from Darafei Praliaskouski
create extension if not exists postgis; begin transaction; SELECT setseed(.43); create table dataTable as select ST_MakePoint(x/1000,y/1000) c from generate_series(1,1e3,1)x, generate_series(1,1e3,1) y; \timing on create index idx on dataTable using gist(c); select q,(select count(*) from dataTable dt where dt.c && q) from (select ST_Expand(ST_MakePoint(x,y), 0.1) q from (select random() x, random() y from generate_series(1,1e4,1) s0) s1) s2 ; select pg_size_pretty(pg_relation_size('idx')); drop table datatable;
comment:7 by , 8 years ago
I put this note on pull request as well. I'll keep this open, since we may need to roll the change back:
This trick doesn't seem useful based on my cursory tests. The index building took longer, size of index is the same, and no gain in performance.
I tested PostGIS 2.3.2 on 9.6 64-bit window 7 and got these timings using @Komzpa tests and then repeated using PostGIS 2.4.0dev that has this patch. The index size are the same and don't see much of a difference in speed, but 2.4 seems slightly worse. Hard to tell absolutely with random data and I won't rule out cause is other changes in 2.4 code base, though I don't think we've touched the gist section except for this.
Time 1: index build Time 2: query Time 3: index size check Time 4: drop table PostGIS 2.3 test results: First run: Time: 14299.226 ms Time: 63453.484 ms Time: 12.544 ms Time: 4.672 ms Second run: Time: 12983.263 ms Time: 62485.045 ms Time: 0.454 ms Time: 7.881 ms Third run: Time: 13205.724 ms Time: 63193.318 ms Time: 3.722 ms Time: 8.986 ms Time: 62.018 ms ` PostGIS 2.4 results: First run: Time: 18914.431 ms Time: 64785.811 ms Time: 7.872 ms Time: 7.698 ms Second run: Time: 14776.264 ms Time: 64393.529 ms Time: 7.124 ms Time: 6.232 ms Third run: Time: 14909.494 ms Time: 64252.608 ms Time: 11.676 ms Time: 6.020 ms
@Komzpa did you get a chance to test this?
comment:8 by , 8 years ago
Guess we should do the same for nd version to see if that is different:
create extension if not exists postgis; begin transaction; SELECT setseed(.43); create table dataTable as select ST_MakePoint(x/1000,y/1000, z/1000) c from generate_series(1,1e1,1)x, generate_series(1,1e1,1) y, generate_series(1,1e1,2) z; \timing on create index idx on dataTable using gist(c gist_geometry_ops_nd); select q,(select count(*) from dataTable dt where dt.c &&& q) from (select ST_Expand(ST_MakePoint(x,y,z), 0.1) q from (select random() x, random() y , random() z from generate_series(1,1e4,1) s0) s1) s2 ; select pg_size_pretty(pg_relation_size('idx')); \timing off drop table datatable; commit;
comment:9 by , 8 years ago
Hi!
I've done the way of installing PostGIS My test machine: Ubuntu 16.04 under Hyper-V PostgreSQL 10devel at HEAD around Feb 2017 I've been testing PostGIS rev 15391 and 15392 (before and after). I confirm that previous case test do not have statistically significant difference. One more important observation is that performance is slightly increased if we remove lines
if (size_orig > 0) { *result = pack_float(size_orig, 1); /* REALM 1 */ } else
This test with less general case
create extension if not exists postgis; create table dataTable as select ST_MakePoint(x/1000,y/1000, 1) c from generate_series(1,1e3,1)x, generate_series(1,1e3,1) y; \timing on create index idx on dataTable using gist(c gist_geometry_ops_nd); begin transaction; SELECT setseed(.43); explain analyze select q,(select count(*) from dataTable dt where dt.c &&& q) from (select ST_Expand(ST_MakePoint(x,y,z), 0.1) q from (select random() x, random() y , 0.95 z from generate_series(1,1e3,1) s0) s1) s2 ; commit transaction; select pg_size_pretty(pg_relation_size('idx')); \set timing off drop table datatable;
sees performance improvement. Before patch selection takes 300 seconds (on my machine) after patch selection takes 37 seconds.
In this test we have 2D data in 3D. This scenario is quite rare for 2d version: all points have to be on the same grid aligned line.
I think in cube I was observing performance improvement attributed to data features, which are already exploited here by Korotkov split. So this patch only fixes degenerate cases of 2D data in ND.
In March 2017 on PgConf.Russia @Komzpa showed me performance problems with data inserted into GiST in-order by stips. This patch should have fixed that, but, actually, as for now, I do not observe that performance problems in reproducible environment.
comment:10 by , 8 years ago
@Komzpa pinged me with a new test with integer grid for 2D
begin transaction; SELECT setseed(.43); create table dataTable as select ST_MakePoint(x,y) c from generate_series(1,1e3,1)x, generate_series(1,1e3,1) y; \timing on create index idx on dataTable using gist(c); select q,(select count(*) from dataTable dt where dt.c && q) from (select ST_Expand(ST_MakePoint(x,y), 100) q from (select random()*1000 x, random()*1000 y from generate_series(1,3e3,1) s0) s1) s2 ; select pg_size_pretty(pg_relation_size('idx')); drop table datatable;
Without patch I observe index with 43MB size and 109.2 seconds for select. With patch index size is 56MB and select time 68.2 seconds. Not a radical performance improvement, but still viable.
I still suggest dropping lines
if (size_orig > 0) { *result = pack_float(size_orig, 1); /* REALM 1 */ } else
They were part of original patch, but in PostGIS I observe several percents of performance improvement without them. If necessary, I can make more precise statistical estimates.
comment:11 by , 8 years ago
I've made two visualizations of line-shaped leaves in index, attached to github PR:
https://github.com/postgis/postgis/pull/129#issuecomment-301413899
To make the visualization yourself, you can follow this instruction: https://gist.github.com/Komzpa/f9e1c241d1d2a27cf5a4f985f646f8a6
comment:12 by , 8 years ago
Well, seems that trick works for certain cases. Can I ask to mention me in header comments? If the patch will be accepted, I'll put it on my CV
comment:13 by , 7 years ago
Owner: | changed from | to
---|
follow-up: 16 comment:14 by , 7 years ago
So I think all that is left before we can close this ticket is to perhaps drop the lines you referred to and some retesting to confirm it improves things.
comment:15 by , 7 years ago
Priority: | medium → blocker |
---|
comment:16 by , 7 years ago
Sorry, I was a bit busy these days.
Replying to robe:
So I think all that is left before we can close this ticket is to perhaps drop the lines you referred to and some retesting to confirm it improves things.
Previously I observed following
Before patch selection takes 300 seconds (on my machine) after patch selection takes 37 seconds… 2D data in 3D
integer grid for 2D … index with 43MB size and 109.2 seconds for select. With patch index size is 56MB and select time 68.2 seconds
This constitute statistically significant performance improvement. Do you want to restart these tests again?
As to
drop the lines you referred to
In May, we have made tests with @Komzpa and found no improvement in dropping the lines. Sorry for not putting that information to the discussion, we were pinging pictures and gevel images to spot the difference(induced by dropping the lines) , but found none.
From my point of view, current state is:
- Best we can achieve by penalty function tuning
- Contains sufficient improvement over revision 15391
comment:17 by , 7 years ago
Resolution: | → fixed |
---|---|
Status: | new → closed |
Okay I'll close this out then as done thanks for getting back to me.
Regina
I've attempted to address it in https://github.com/postgis/postgis/pull/129 - for some reason Travis fails to build Postgis (debian dependency problem).