Opened 4 years ago
Closed 3 years ago
#4857 closed defect (worksforme)
Large ST_Union makes PostgreSQL temporarily unrestartable
Reported by: | grassick | Owned by: | pramsey |
---|---|---|---|
Priority: | high | Milestone: | PostGIS 3.1.4 |
Component: | postgis | Version: | 3.1.x |
Keywords: | Cc: |
Description
A query containing an ST_Union with a large number of features becomes un-killable until it completes. You can't even restart PostgreSQL or reboot the server until it's done.
e.g. "select ST_Union(ST_Buffer(location, 10000)) from water_points" where water_points has 400,000 records.
Server log shows:
2021-02-18 16:16:39.939 UTC [57720] LOG: received fast shutdown request 2021-02-18 16:16:39.945 UTC [57720] LOG: aborting any active transactions 2021-02-18 16:16:39.947 UTC [264426] postgres@postgres FATAL: terminating connection due to administrator command 2021-02-18 16:16:39.947 UTC [264427] postgres@mwater_opt FATAL: terminating connection due to administrator command 2021-02-18 16:16:39.950 UTC [57720] LOG: background worker "logical replication launcher" (PID 244337) exited with exit code 1 2021-02-18 16:18:14.865 UTC [264759] postgres@mwater_opt FATAL: terminating connection due to administrator command 2021-02-18 16:18:14.865 UTC [264759] postgres@mwater_opt STATEMENT: select ST_Union(ST_Buffer(location, 10000)) from water_points 2021-02-18 16:18:14.962 UTC [57764] LOG: shutting down 2021-02-18 16:18:16.751 UTC [57720] LOG: database system is shut down 2021-02-18 16:18:17.156 UTC [264881] LOG: starting PostgreSQL 13.2 (Ubuntu 13.2-1.pgdg20.04+1) on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0, 64-bit
Note the 2 minute delay of the database being completely unavailable!
Attachments (1)
Change History (20)
comment:1 by , 4 years ago
comment:2 by , 4 years ago
I'm encountering it on:
POSTGIS="3.1.1 aaf4c79" [EXTENSION] PGSQL="130" GEOS="3.8.0-CAPI-1.13.1 " PROJ="6.3.1" GDAL="GDAL 3.0.4, released 2020/01/28"
and also on
POSTGIS="3.1.1 aaf4c79" [EXTENSION] PGSQL="130" GEOS="3.7.1-CAPI-1.11.1 27a5e771" PROJ="Rel. 4.9.3, 15 August 2016" GDAL="GDAL 2.2.3, released 2017/11/20"
In an odd twist, when I tested it again today, I was able to kill the query after a few seconds in pgAdmin. But I was unable to easily kill it by restarting the database (it took a full 100 seconds to restart).
comment:3 by , 4 years ago
Also, any chance the data set / query are public, or you can replicate with a public data set? Makes it easier to confirm the fix is actually a fix.
comment:4 by , 4 years ago
Hm, so maybe this is not to do with the GEOS interrupt at all, but more to do with the kinds of interrupt we're working with at the PostGIS layer. I *think* that restart sends kind of a "gentle interrupt", perhaps the issue is that we only respond to a harder one (whatever the analogue to a ctrl-C in psql is). I did find interrupt handling code in GEOS in the main noding routines, so it's unlikely to wait very long to hit a handler.
comment:5 by , 4 years ago
Yeah, bingo, the shutdown sequence sends SIGTERM, but PostGIS only handles SIGINT. https://www.postgresql.org/docs/current/server-shutdown.html
comment:6 by , 4 years ago
The query for testing it is:
select ST_Union(ST_Buffer(location, 10000)) from test_union
Thanks for looking into this!
comment:7 by , 4 years ago
comment:8 by , 4 years ago
Yeah, a fast shutdown is a SIGINT, so maybe I'm just totally wrong, and in fact you've just found a case that doesn't interrupt very fast in one place… There's definitely interrupts all through the union and buffer code, but there's also some places (overlay validation) that could maybe run a long time? So whether you got a "fast" interrupt or not would be somewhat random depending on where the process happened to be when you hit the ctrl-c.
comment:9 by , 4 years ago
It's a strange one… but as long as I let it get 15 seconds or so into the query, I can reproduce it every time using "/etc/init.d/postgresql restart"
comment:10 by , 4 years ago
It's… not even a slam dunk what to fix. There's an interrupt right in MCIndexNoder::intersectChains
and sure enough, when I get into that place where a long hang occurs after interrupt, and attach the debugger, that's right where the process is. Basically to fix this we'd need to move the interrupt deeper into the loop, but that would increase the overhead for all the times we're not trying to end the process.
comment:11 by , 4 years ago
I think this is a "won't fix". I've added some extra interrupt points to GEOS 3.9+ that were missing relative to the older versions, but I can't see a way to do anything without potentially impacting normal operations for smaller geometries. You've just got an extra nasty case here that happens to have some very long run-times for some of the stages of the union.
comment:12 by , 4 years ago
Before you "won't fix" it, I can now reproduce every time having the query almost instantly killable by pgAdmin:
ERROR: lwgeom_unaryunion_prec: GEOS Error: InterruptedException: Interrupted! SQL state: XX000
But it's very slow to kill the same query at the same point by restarting the server.
There must be some difference in the behavior between the two types of kill. It creates a real problem in that restarting the server can bring the entire production database down for up to two minutes.
I understand this is a frustrating bug, but if there's a way to cause it to abort in the same way during a database shutdown, it would prevent some major outages.
Thanks again!
comment:13 by , 4 years ago
Hello grassick,
Can you make sure your restart procedure actually calls fast shutdown and not smart shutdown? Most distros use smart shutdown and I believe it's default.
If it is a fast shutdown it has to be indistinguishable for PostGIS from Ctrl-C unless there is an issue in Postgres itself not dispatching it properly.
comment:14 by , 4 years ago
I double checked by using kill directly. With a SIGINT, it still takes at least 45 seconds to abort the query.
The logs say:
2021-02-23 14:18:07.856 UTC [347211] LOG: received fast shutdown request 2021-02-23 14:18:07.859 UTC [347211] LOG: aborting any active transactions 2021-02-23 14:18:51.650 UTC [347272] postgres@mwater_opt FATAL: terminating connection due to administrator command 2021-02-23 14:18:51.650 UTC [347272] postgres@mwater_opt STATEMENT: select ST_Union(ST_Buffer(location, 10000)) from test_union
If I interrupt it using pgAdmin, it stops instantly with:
2021-02-23 14:22:44.305 UTC [347394] postgres@mwater_opt ERROR: lwgeom_unaryunion_prec: GEOS Error: InterruptedException: Interrupted! 2021-02-23 14:22:44.305 UTC [347394] postgres@mwater_opt STATEMENT: select ST_Union(ST_Buffer(location, 10000)) from test_union
The only conclusion seems to me to be that there are two types of aborts coming from PostgreSQL, one that PostGIS handles gracefully and the other that it doesn't. Unfortunately, I don't know enough about the inner workings to diagnose further.
comment:15 by , 4 years ago
What you're seeing is inconsistent with what we know from documentation and from experience.
- You see a swift cancel from PgAdmin, which is causing a SIGINT to be sent to the backend and which PostGIS / GEOS handle with alacrity.
- You are seeing a slow cancel from the shutdown.
- According to the documentation, and ordinary shutdown sends SIGTERM, and PostGIS/GEOS ignore that signal (as the documentation would indicate they *should*, since during a normal shutdown PgSQL is expected to let running queries run to completion).
- You show a log that *says* it is running a "fast shutdown" which the documentation *says* should generate a SIGINT which we know from above generates a swift cancel, and yet you observe takes longer than you'd expect.
The only possibility I can see that matches all the above is that the "fast shutdown" does not in fact immediately send a SIGINT, but first sends a SIGTERM but waits for a bit and then sends SIGINT if things haven't ended on their own.
That still doesn't leave anything for us to fix. We are doing everything *right*. In the event of a SIGINT, we swiftly stop. In the event of a SIGTERM, we just let things run. I don't see how this should remain an open PostGIS ticket.
comment:16 by , 3 years ago
Milestone: | PostGIS 3.1.2 → 3.1.3 |
---|
comment:19 by , 3 years ago
Resolution: | → worksforme |
---|---|
Status: | new → closed |
I don't see how this should remain an open ticket either.
Probably need some extra interrupt points in GEOS, what GEOS version underlies this test?