Opened 3 years ago

Closed 3 years ago

#4857 closed defect (worksforme)

Large ST_Union makes PostgreSQL temporarily unrestartable

Reported by: grassick Owned by: pramsey
Priority: high Milestone: PostGIS 3.1.4
Component: postgis Version: 3.1.x
Keywords: Cc:

Description

A query containing an ST_Union with a large number of features becomes un-killable until it completes. You can't even restart PostgreSQL or reboot the server until it's done.

e.g. "select ST_Union(ST_Buffer(location, 10000)) from water_points" where water_points has 400,000 records.

Server log shows:

2021-02-18 16:16:39.939 UTC [57720] LOG:  received fast shutdown request
2021-02-18 16:16:39.945 UTC [57720] LOG:  aborting any active transactions
2021-02-18 16:16:39.947 UTC [264426] postgres@postgres FATAL:  terminating connection due to administrator command
2021-02-18 16:16:39.947 UTC [264427] postgres@mwater_opt FATAL:  terminating connection due to administrator command
2021-02-18 16:16:39.950 UTC [57720] LOG:  background worker "logical replication launcher" (PID 244337) exited with exit code 1
2021-02-18 16:18:14.865 UTC [264759] postgres@mwater_opt FATAL:  terminating connection due to administrator command
2021-02-18 16:18:14.865 UTC [264759] postgres@mwater_opt STATEMENT:  select ST_Union(ST_Buffer(location, 10000)) from water_points
2021-02-18 16:18:14.962 UTC [57764] LOG:  shutting down
2021-02-18 16:18:16.751 UTC [57720] LOG:  database system is shut down
2021-02-18 16:18:17.156 UTC [264881] LOG:  starting PostgreSQL 13.2 (Ubuntu 13.2-1.pgdg20.04+1) on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0, 64-bit

Note the 2 minute delay of the database being completely unavailable!

Attachments (1)

test_union.pg (4.8 MB ) - added by grassick 3 years ago.
Backup of file for testing it

Change History (20)

comment:1 by pramsey, 3 years ago

Probably need some extra interrupt points in GEOS, what GEOS version underlies this test?

comment:2 by grassick, 3 years ago

I'm encountering it on:

POSTGIS="3.1.1 aaf4c79" [EXTENSION] PGSQL="130" GEOS="3.8.0-CAPI-1.13.1 " PROJ="6.3.1" GDAL="GDAL 3.0.4, released 2020/01/28"

and also on

POSTGIS="3.1.1 aaf4c79" [EXTENSION] PGSQL="130" GEOS="3.7.1-CAPI-1.11.1 27a5e771" PROJ="Rel. 4.9.3, 15 August 2016" GDAL="GDAL 2.2.3, released 2017/11/20"

In an odd twist, when I tested it again today, I was able to kill the query after a few seconds in pgAdmin. But I was unable to easily kill it by restarting the database (it took a full 100 seconds to restart).

comment:3 by pramsey, 3 years ago

Also, any chance the data set / query are public, or you can replicate with a public data set? Makes it easier to confirm the fix is actually a fix.

comment:4 by pramsey, 3 years ago

Hm, so maybe this is not to do with the GEOS interrupt at all, but more to do with the kinds of interrupt we're working with at the PostGIS layer. I *think* that restart sends kind of a "gentle interrupt", perhaps the issue is that we only respond to a harder one (whatever the analogue to a ctrl-C in psql is). I did find interrupt handling code in GEOS in the main noding routines, so it's unlikely to wait very long to hit a handler.

comment:5 by pramsey, 3 years ago

Yeah, bingo, the shutdown sequence sends SIGTERM, but PostGIS only handles SIGINT. https://www.postgresql.org/docs/current/server-shutdown.html

by grassick, 3 years ago

Attachment: test_union.pg added

Backup of file for testing it

comment:6 by grassick, 3 years ago

The query for testing it is:

select ST_Union(ST_Buffer(location, 10000)) from test_union

Thanks for looking into this!

comment:7 by grassick, 3 years ago

The shutdown in the logs says:

2021-02-22 18:13:46.079 UTC [331417] LOG: received fast shutdown request 2021-02-22 18:13:46.084 UTC [331417] LOG: aborting any active transactions

which should be a SIGINT according to my reading of the docs?

comment:8 by pramsey, 3 years ago

Yeah, a fast shutdown is a SIGINT, so maybe I'm just totally wrong, and in fact you've just found a case that doesn't interrupt very fast in one place… There's definitely interrupts all through the union and buffer code, but there's also some places (overlay validation) that could maybe run a long time? So whether you got a "fast" interrupt or not would be somewhat random depending on where the process happened to be when you hit the ctrl-c.

comment:9 by grassick, 3 years ago

It's a strange one… but as long as I let it get 15 seconds or so into the query, I can reproduce it every time using "/etc/init.d/postgresql restart"

comment:10 by pramsey, 3 years ago

It's… not even a slam dunk what to fix. There's an interrupt right in MCIndexNoder::intersectChains and sure enough, when I get into that place where a long hang occurs after interrupt, and attach the debugger, that's right where the process is. Basically to fix this we'd need to move the interrupt deeper into the loop, but that would increase the overhead for all the times we're not trying to end the process.

https://github.com/libgeos/geos/blob/e4fd106f4f66167b554389c7b441d9bc5ee89b1b/src/noding/MCIndexNoder.cpp#L77

comment:11 by pramsey, 3 years ago

I think this is a "won't fix". I've added some extra interrupt points to GEOS 3.9+ that were missing relative to the older versions, but I can't see a way to do anything without potentially impacting normal operations for smaller geometries. You've just got an extra nasty case here that happens to have some very long run-times for some of the stages of the union.

comment:12 by grassick, 3 years ago

Before you "won't fix" it, I can now reproduce every time having the query almost instantly killable by pgAdmin:

ERROR:  lwgeom_unaryunion_prec: GEOS Error: InterruptedException: Interrupted!
SQL state: XX000

But it's very slow to kill the same query at the same point by restarting the server.

There must be some difference in the behavior between the two types of kill. It creates a real problem in that restarting the server can bring the entire production database down for up to two minutes.

I understand this is a frustrating bug, but if there's a way to cause it to abort in the same way during a database shutdown, it would prevent some major outages.

Thanks again!

Last edited 3 years ago by grassick (previous) (diff)

comment:13 by komzpa, 3 years ago

Hello grassick,

Can you make sure your restart procedure actually calls fast shutdown and not smart shutdown? Most distros use smart shutdown and I believe it's default.

If it is a fast shutdown it has to be indistinguishable for PostGIS from Ctrl-C unless there is an issue in Postgres itself not dispatching it properly.

comment:14 by grassick, 3 years ago

I double checked by using kill directly. With a SIGINT, it still takes at least 45 seconds to abort the query.

The logs say:

2021-02-23 14:18:07.856 UTC [347211] LOG:  received fast shutdown request
2021-02-23 14:18:07.859 UTC [347211] LOG:  aborting any active transactions
2021-02-23 14:18:51.650 UTC [347272] postgres@mwater_opt FATAL:  terminating connection due to administrator command
2021-02-23 14:18:51.650 UTC [347272] postgres@mwater_opt STATEMENT:  select ST_Union(ST_Buffer(location, 10000)) from test_union

If I interrupt it using pgAdmin, it stops instantly with:

2021-02-23 14:22:44.305 UTC [347394] postgres@mwater_opt ERROR:  lwgeom_unaryunion_prec: GEOS Error: InterruptedException: Interrupted!
2021-02-23 14:22:44.305 UTC [347394] postgres@mwater_opt STATEMENT:  select ST_Union(ST_Buffer(location, 10000)) from test_union

The only conclusion seems to me to be that there are two types of aborts coming from PostgreSQL, one that PostGIS handles gracefully and the other that it doesn't. Unfortunately, I don't know enough about the inner workings to diagnose further.

comment:15 by pramsey, 3 years ago

What you're seeing is inconsistent with what we know from documentation and from experience.

  • You see a swift cancel from PgAdmin, which is causing a SIGINT to be sent to the backend and which PostGIS / GEOS handle with alacrity.
  • You are seeing a slow cancel from the shutdown.
  • According to the documentation, and ordinary shutdown sends SIGTERM, and PostGIS/GEOS ignore that signal (as the documentation would indicate they *should*, since during a normal shutdown PgSQL is expected to let running queries run to completion).
  • You show a log that *says* it is running a "fast shutdown" which the documentation *says* should generate a SIGINT which we know from above generates a swift cancel, and yet you observe takes longer than you'd expect.

The only possibility I can see that matches all the above is that the "fast shutdown" does not in fact immediately send a SIGINT, but first sends a SIGTERM but waits for a bit and then sends SIGINT if things haven't ended on their own.

That still doesn't leave anything for us to fix. We are doing everything *right*. In the event of a SIGINT, we swiftly stop. In the event of a SIGTERM, we just let things run. I don't see how this should remain an open PostGIS ticket.

comment:16 by pramsey, 3 years ago

Milestone: PostGIS 3.1.23.1.3

comment:17 by pramsey, 3 years ago

Milestone: 3.1.3PostGIS 3.1.3

Milestone renamed

comment:18 by robe, 3 years ago

Milestone: PostGIS 3.1.3PostGIS 3.1.4

In prep for 3.1.3 release

comment:19 by robe, 3 years ago

Resolution: worksforme
Status: newclosed

I don't see how this should remain an open ticket either.

Note: See TracTickets for help on using tickets.