Opened 9 years ago

Closed 9 years ago

Last modified 9 years ago

#3382 closed enhancement (fixed)

Avoid deserializing small geometries during index operations

Reported by: dbaston Owned by: pramsey
Priority: high Milestone: PostGIS 2.3.0
Component: postgis Version: master
Keywords: Cc:

Description

In doing some work on #75, I decided to profile some simple point-in-polygon queries using Callgrind. I was surprised to see that much of the time was being taken by gserialized_gist_2d.c, around line 565. Since points aren't carrying a box, this code path was doing a lot of work deserializing points to compute a using generic routines.

However, a more efficient approach for these corner cases has already been implemented in the gserialized_peek_gbox_p function. Replacing the current deserialization code with a call to gserialized_get_gbox_p allows gserialized_gist_2d to take advantage of the more efficient routines without adding any complexity.

In my testing this is bringing about a 30-40% performance increase in point-in-polygon joins.

Attachments (1)

callgrind.out.22131.gz (150.9 KB ) - added by dbaston 9 years ago.
callgrind output for geography point-in-poly join

Download all attachments as: .zip

Change History (16)

comment:1 by dbaston, 9 years ago

Pull request at https://github.com/postgis/postgis/pull/76

This contains an unrelated fix for a glitch in gserialized_peek_gbox_p, which caused a garbage box to be computed for MULTIPOINT(EMPTY), making the WKT regression test fail.

comment:2 by robe, 9 years ago

pramsey do you have issue with dbaston just committing this work to 2.3? I'm itching to test out a 30-40% improvement but too lazy to pull another repo to do it. :).

comment:3 by dbaston, 9 years ago

Updated the PR with an some cunit tests and an additional bugfix for gserialized_peek_gbox_p. I can pull that stuff out into a separate ticket/patch if that's preferable.

History in the PR is cluttered by a merge (there has to be a better way), but you can see the actual changes proposed at https://github.com/postgis/postgis/compare/svn-trunk...dbaston:trac-3382

comment:4 by strk, 9 years ago

You can avoid a merge by rebasing your changes on top of trunk:

git rebase github/svn-trunk

Then you'll have to force-push to the branch associated with your PR (not sure what it is)

git push -f # assuming it'll push to the appropriate branch

comment:5 by robe, 9 years ago

I had to rebase my repo for another project working on and found this article.

https://github.com/edx/edx-platform/wiki/How-to-Rebase-a-Pull-Request

In my case it was fairly simple since i just wanted to synch after my pull request was accepted. Seems you are missing a get remote line

So my command looked like

git clone https:/github...your_repo  somefolder
cd somefolder
git remote add somename_base https://github/the_master_url
git fetch somename_base
git rebase somename_base/svn-trunk
git push
Version 1, edited 9 years ago by robe (previous) (next) (diff)

comment:6 by dbaston, 9 years ago

That's a fantastic link, thank you!

Hopefully @pramsey can weigh in on this one before another sync becomes necessary :)

comment:7 by pramsey, 9 years ago

I really cannot see the relevant changeset in the pull request, unfortunately, just the new tests and the fix for the multipoint(empty) case (good catch). Just go for it, you can always pick up the pieces if something breaks (yee haw).

comment:8 by dbaston, 9 years ago

Committed to trunk at r14478.

robe, if you see a decent performance benefit from this, let me know. It seems like it would be pretty simple to extend the "gserialized_peek" function that's behind this to handle geography points as well. (The gains I was seeing were in counting intersections between 20,000 polygons — TIGER faces for FIPS 25001 — and 100,000 random points selected from the extent of that dataset.)

comment:9 by dbaston, 9 years ago

Resolution: fixed
Status: newclosed

comment:10 by pramsey, 9 years ago

Don't try to extend this to geography. The bounding box of a two-point line in geography may not be just the line's extrema.

comment:11 by robe, 9 years ago

dbaston - okay will test later in the week or weekend.

in reply to:  10 comment:12 by dbaston, 9 years ago

Replying to pramsey:

Don't try to extend this to geography. The bounding box of a two-point line in geography may not be just the line's extrema.

Understood, but this should be safe and simple to do for geography points, no?

comment:13 by pramsey, 9 years ago

The point geographic coordinates ahve to be converted to geocentric before they are a true "geography bounding box" but yes(ish). I'm not actually sure if we store boxes on geography points or not.

comment:14 by robe, 9 years ago

I think my commit at #3361 made it so that we never store boxes for geography points. I'd have to verify. Before it wasn't consistent.

by dbaston, 9 years ago

Attachment: callgrind.out.22131.gz added

callgrind output for geography point-in-poly join

comment:15 by dbaston, 9 years ago

I ran a geography version of the same test scenario. It looks like same issue is present (deserializing to compute a box), but it's much less significant, around 10% of query time. It looks like the relevant place to change would be in libpgcommon/gserialized_gist.c:289. Of course, you still have to compute the box, so maybe you could bring it down to 6-8%…doesn't quite seem worth it.

I attached a copy of my callgrind output if anyone's interested. I recently discovered kcachegrind, which is a really nice viewer for these files.

Note: See TracTickets for help on using tickets.