#3864 closed defect (fixed)
Sorting by geometry is slower than sorting by geohash
Reported by: | komzpa | Owned by: | pramsey |
---|---|---|---|
Priority: | medium | Milestone: | PostGIS 2.4.1 |
Component: | postgis | Version: | 2.3.x |
Keywords: | Cc: |
Description (last modified by )
[local] gis@gis=# create table juno_osm_point_unclustered as (select * from juno_osm_point ORDER BY osm_id); SELECT 159446335 Time: 601714,217 ms (10:01,714) [local] gis@gis=# create table juno_osm_point_pt1 as (select * from juno_osm_point_unclustered ORDER BY ST_GeoHash(ST_Transform(ST_Envelope(way),4326),10) COLLATE "C"); SELECT 159446335 Time: 1248105,490 ms (20:48,105) [local] gis@gis=# create table juno_osm_point_pt2 as (select * from juno_osm_point_unclustered ORDER BY way); SELECT 159446335 Time: 1287415,365 ms (21:27,415)
This may happen as there are some other infrastructure optimizations in string sorting code, like abbreviated key comparsions.
Change History (29)
comment:1 by , 7 years ago
comment:2 by , 7 years ago
Description: | modified (diff) |
---|
comment:3 by , 7 years ago
I guess. You're sorting 150M things and the difference is quite small. I admit I'm a little surprised that geohash isn't worse since I'm pretty sure it has to do a lot more work to generate a geohash that is needed to generate one of our pretend morton keys, which makes me wonder all the more about the result, and whether it's actually something to do with the sort itself.
comment:4 by , 7 years ago
In GeoHash scenario, it's all just being reprojected once, and later being sorted as integers due to abbreviated keys.
Exposing the morton key as abbreviated key can help: https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=4ea51cdfe85ceef8afabceb03c446574daa0ac23
comment:5 by , 7 years ago
Please also note that fisrt time for the reference there is a osm_id sort, that is plain bigint sort. It finished twice as fast. There is no index on any sort in question.
comment:6 by , 7 years ago
Milestone: | PostGIS 2.4.0 → PostGIS 2.4.1 |
---|
comment:7 by , 7 years ago
Just to note, ST_GeoHash(ST_Transform(ST_Envelope(way),4326),10) COLLATE "C")
is tweaked based on OSM data, but the selection of 10 was done prior to the commit Komzpa linked, the C collation.
For geohash sorting, the C collation is vital and nearly cuts the time in half.
https://github.com/openstreetmap/osm2pgsql/pull/242 for GeoHash details https://github.com/openstreetmap/osm2pgsql/issues/581 for collation details.
The osm_id sort puts a lower bound on the theoretical speed.
comment:8 by , 7 years ago
It looks like ticket contains pre-2.4.0 build.
I've tried the same on 2.4.0, and it allocates more than 140Gb of RAM on order by way
. On GeoHash sort it allocates up to 10Gb then lowers to 3Gb. I didn't try to wait until query finish.
comment:9 by , 7 years ago
[local] gis@gis=# create table juno_osm_point_unclustered as (select * [more] ( > from juno_osm_point ORDER BY osm_id); SELECT 159843185 Time: 768781,686 ms (12:48,782) [local] gis@gis=# create table juno_osm_point_pt1 as (select * from juno_osm_point_unclustered ORDER BY ST_GeoHash(ST_Transform(ST_Envelope(way),4326),10) COLLATE "C"); SELECT 159843185 Time: 1223812,285 ms (20:23,812) [local] gis@gis=# create table juno_osm_point_pt2 as (select * from juno_osm_point_unclustered ORDER BY way); ERROR: 57014: canceling statement due to user request LOCATION: ProcessInterrupts, postgres.c:2999 Time: 932325,460 ms (15:32,325) -- here postgres ate 140GB of my 128 GB machine so I cancelled it. [local] gis@gis=# create table juno_osm_point_pt3 as (select * from juno_osm_point_unclustered ORDER BY way::bytea); SELECT 159843185 Time: 1055460,975 ms (17:35,461) -- this can be used as reasonable performance estimator
comment:10 by , 7 years ago
About 128GB been mercilessly eated in komzpa`s case: via gdb I can detect that there is a lot of palloc() coming from lwgeom_cmp, but no pfree():
#0 palloc (size=size@entry=32) at ./build/../src/backend/utils/mmgr/mcxt.c:850 #1 0x00005575da620a0a in heap_tuple_untoast_attr (attr=0x5577c344d1c0) at ./build/../src/backend/access/heap/tuptoaster.c:240 #2 0x00007f6c77386af2 in lwgeom_cmp () from /usr/lib/postgresql/10/lib/postgis-2.4.so #3 0x00005575da9e76b1 in comparison_shim (x=<optimized out>, y=<optimized out>, ssup=<optimized out>) at ./build/../src/backend/utils/sort/sortsupport.c:53 #4 0x00005575da9ec64e in ApplySortComparator (ssup=0x5575dbbb7b98, isNull2=<optimized out>, datum2=<optimized out>, isNull1=<optimized out>, datum1=<optimized out>) at ./build/../src/include/utils/sortsupport.h:225 #5 qsort_ssup (a=0x7f6c37b25078, n=<optimized out>, n@entry=14, ssup=ssup@entry=0x5575dbbb7b98) at ./qsort_tuple.c:244 #6 0x00005575da9ec930 in qsort_ssup (a=0x7f6c37b24b20, a@entry=0x7f6c37b24340, n=<optimized out>, ssup=ssup@entry=0x5575dbbb7b98) at ./qsort_tuple.c:323 #7 0x00005575da9ec55d in qsort_ssup (a=0x7f6c37b24340, n=<optimized out>, n@entry=778, ssup=ssup@entry=0x5575dbbb7b98) at ./qsort_tuple.c:309 #8 0x00005575da9ec930 in qsort_ssup (a=0x7f6c37b1d470, n=<optimized out>, n@entry=40979, ssup=ssup@entry=0x5575dbbb7b98) at ./qsort_tuple.c:323 #9 0x00005575da9ec930 in qsort_ssup (a=0x7f6c37924d20, n=<optimized out>, n@entry=140380, ssup=ssup@entry=0x5575dbbb7b98) at ./qsort_tuple.c:323 #10 0x00005575da9ec930 in qsort_ssup (a=0x7f6c3755cf20, a@entry=0x7f6c3710f6c8, n=<optimized out>, ssup=ssup@entry=0x5575dbbb7b98) at ./qsort_tuple.c:323 #11 0x00005575da9ec55d in qsort_ssup (a=a@entry=0x7f6c3710f6c8, n=<optimized out>, ssup=ssup@entry=0x5575dbbb7b98) at ./qsort_tuple.c:309 #12 0x00005575da9ec55d in qsort_ssup (a=a@entry=0x7f6c3710f6c8, n=<optimized out>, ssup=ssup@entry=0x5575dbbb7b98) at ./qsort_tuple.c:309 #13 0x00005575da9ec55d in qsort_ssup (a=a@entry=0x7f6c3710f6c8, n=<optimized out>, ssup=ssup@entry=0x5575dbbb7b98) at ./qsort_tuple.c:309 #14 0x00005575da9ec55d in qsort_ssup (a=0x7f6c3710f6c8, n=<optimized out>, n@entry=20343532, ssup=ssup@entry=0x5575dbbb7b98) at ./qsort_tuple.c:309 #15 0x00005575da9ec930 in qsort_ssup (a=0x7f6c161fb048, n=<optimized out>, ssup=<optimized out>) at ./qsort_tuple.c:323 #16 0x00005575da9eca89 in tuplesort_sort_memtuples (state=state@entry=0x5575dbbb7788) at ./build/../src/backend/utils/sort/tuplesort.c:3514
comment:11 by , 7 years ago
Missing PG_FREE_IF_COPY in here:
PG_FUNCTION_INFO_V1(lwgeom_cmp); Datum lwgeom_cmp(PG_FUNCTION_ARGS) { GSERIALIZED *g1 = PG_GETARG_GSERIALIZED_P(0); GSERIALIZED *g2 = PG_GETARG_GSERIALIZED_P(1); PG_RETURN_INT32(gserialized_cmp(g1, g2)); }
comment:13 by , 7 years ago
Can you test r15863 = 97fd8b3b0f40b47a87d6d793fe3bf6590855ebc6 (refs/remotes/svn/trunk) If confirmed to fix the excessive memory usage it should be backported to 2.4 branch
comment:14 by , 7 years ago
Thanks, memory issue went away. Do the backport.
As per speed, on pre-warmed table it becomes even slower now:
[local] gis@gis=# create table juno_osm_point_pt2 as (select * from juno_osm_point_unclustered ORDER BY way); SELECT 159843185 Time: 1510500,993 ms (25:10,501)
comment:16 by , 7 years ago
I think it is expected that cleaning up while working keeps your table clean but makes you work somewhat slower …
comment:17 by , 7 years ago
I dunno, I decided to profile it and couldn't see any big cpu-eating monsters there. I also ran my own timing on my own data (only 200k records) and found the direct geom ORDER BY to be slightly faster…
# select gid from ( select * from roads order by geom ) a limit 1; Time: 1193.123 ms (00:01.119) # select gid from ( select * from roads order by ST_GeoHash(ST_Transform(ST_Envelope(geom),4326),10) COLLATE "C" ) a limit 1; Time: 1373.566 ms (00:01.374)
comment:18 by , 7 years ago
The difference I see is that my dataset is made of points. Can you check on points dataset? It should perform more work by getting box from point first.
It feels that hash can be made of a point directly without building box first. Also for a point you don't need any comparsions after.
comment:19 by , 7 years ago
3.5% of improvement via memcpy removal in #3871:
[local] gis@gis=# create table juno_osm_point_pt2 as (select * from juno_osm_point_unclustered ORDER BY way); SELECT 159843185 Time: 1459625,971 ms (24:19,626)
comment:20 by , 7 years ago
I cannot confirm that polygon data sort is faster, too, although the difference is 7%, not 20 as in point case:
[local] gis@gis=# create table juno_osm_polygon_pt1 as (select * from juno_osm_polygon_unclustered order by ST_GeoHash(ST_Transform(ST_Envelope(way),4326),10) COLLATE "C"); SELECT 9936051 Time: 211660,739 ms (03:31,661) [local] gis@gis=# create table juno_osm_polygon_pt2 as (select * from juno_osm_polygon_unclustered order by way); SELECT 9936051 Time: 226570,568 ms (03:46,571) [local] gis@gis=# create table juno_osm_polygon_pt3 as (select * from juno_osm_polygon_unclustered order by way::bytea); SELECT 9936051 Time: 187295,132 ms (03:07,295)
This SELECT form is used in osm2pgsql workflow.
I've tried to have a look at what can be done for points in https://github.com/postgis/postgis/pull/146
comment:23 by , 7 years ago
Move further discussion of sortsupport improvements for geometry sorting into a fresh ticket.
comment:24 by , 7 years ago
After recompiling Postgres for server's target architecture and enabling -Ofast, on latest trunk - we're 12% slower than GeoHash instead of initial 20%:
[local] gis@gis=# create table juno_osm_point_pt1_fastmath as (select * from juno_osm_point_unclustered ORDER BY ST_GeoHash(ST_Transform(ST_Envelope(way),4326),10) COLLATE "C"); SELECT 159843185 Time: 928645,823 ms (15:28,646) [local] gis@gis=# create table juno_osm_point_pt2_fastmath as (select * from juno_osm_point_unclustered ORDER BY way); SELECT 159843185 Time: 1052055,649 ms (17:32,056)
comment:25 by , 7 years ago
I still find this very hard to wrap my head around. Look at how much work there is in the geohash calculation function — before any sorting can happen, all those expensive keys have to be calculated. So all we have left is the idea that somehow, in your very large sort, the cost of the sort itself washes out the cost of key calculation. So the sortsupport stuff better be the answer or I'm out of ideas; makes no sense.
comment:27 by , 7 years ago
It indeed helps a lot:
[local] gis@gis=# create table juno_osm_point_pt2_fastmath_ordered as (select * from juno_osm_point_pt2_fastmath order by way); SELECT 159843185 Time: 596587,644 ms (09:56,588)
comment:28 by , 7 years ago
The pre-sort ordering is very significant - if it's already ordered, much less work needs to be done.
What's the performance like for ordering for a 2nd time with a geohash order, and with an id order?
comment:29 by , 7 years ago
Bigint sorted sort is the same time as sorted geom and is almost same as unsorted bigint. Geohash time is same as unsorted Geohash.
[local] gis@gis=# create table juno_osm_point_pt1_fastmath_ordered as (select * from juno_osm_point_pt1_fastmath ORDER BY ST_GeoHash(ST_Transform(ST_Envelope(way),4326),10) COLLATE "C"); SELECT 159843185 Time: 935372,463 ms (15:35,372) [local] gis@gis=# create table juno_osm_point_unclustered_ordered as (select * from juno_osm_point_unclustered order by osm_id); SELECT 159843185 Time: 596486,915 ms (09:56,487)
Which version of PostGIS is this?
SELECT postgis_full_version();
You have it marked as 2.3, but just want to make sure it really is 2.3 and not 2.4. Since pramsey changed the sort behavior in 2.4