Opened 18 months ago
Closed 17 months ago
#5385 closed defect (fixed)
Postgres malloc assertion fail when using pg_cancel_backend with ST_AsMVT
Reported by: | gbartonowenstl | Owned by: | pramsey |
---|---|---|---|
Priority: | high | Milestone: | PostGIS 3.1.10 |
Component: | postgis | Version: | 3.3.x |
Keywords: | malloc, crash | Cc: | gbartonowenstl |
Description
We recently experienced a very odd behaviour where our postgres database would experience a malloc assertion fail which would force the entire database into recovery mode, dropping all currently running queries for 10-20 seconds while it recovers - which, as you can imagine, caused singificant disruption to our service.
The log is as follows:
Apr 20 16:30:46 wag-prod postgres[12134]: [21] <database: - user: - timestamp:2023-04-20 16:30:46.440 UTC - host: - commandtag: - sqlerrorcode:00000 - pid:12134 - processstarttime:2023-04-19 01:54:51 UTC> LOG: server process (PID 1037998) was terminated by signal 6: Aborted <database: - user: - timestamp:2023-04-20 16:30:46.440 UTC - host: - commandtag: - sqlerrorcode:00000 - pid:12134 - processstarttime:2023-04-19 01:54:51 UTC> DETAIL: Failed process was running: -- <URL> <MVT Query> Apr 20 16:30:46 wag-prod postgres[12134]: [22] <database: - user: - timestamp:2023-04-20 16:30:46.440 UTC - host: - commandtag: - sqlerrorcode:00000 - pid:12134 - processstarttime:2023-04-19 01:54:51 UTC> LOG: terminating any other active server processes
bprod 10.82.122.108(57900) SELECT: malloc.c:2617: sysmalloc: Assertion `(old_top == initial_top (av) && old_size == 0) || ((unsigned long) (old_size) >= MINSIZE && prev_inuse (old_top) && ((unsigned long) old_end & (pagesize - 1)) == 0)' failed.
We traced the root cause to our newly increased use of calling pg_cancel_backend() within the narrow window in which ST_AsMVT or ST_AsMVTGeom queries were in flight.
I've created a recreation that calls a stand alone MVT query, and runs pg_cancel_backend in a loop, until we hit this recovery mode case.
Attachments (1)
Change History (16)
by , 18 months ago
Attachment: | malloc_reproduction.zip added |
---|
comment:1 by , 18 months ago
@gbartonwenstl,
What does
SELECT postgis_full_version(), version();
return for your system. I'm running your tests against our latest master branch on debian 11. Still waiting for it to crash the backend but it's still running.
comment:2 by , 18 months ago
okay it did finally crash
psql:mvt_style_call_minimal_reproduction.sql:53: ERROR: canceling statement due to user request 5036 psql:mvt_style_call_minimal_reproduction.sql:53: server closed the connection unexpectedly This probably means the server terminated abnormally before or while processing the request. psql:mvt_style_call_minimal_reproduction.sql:53: error: connection to server was lost 5037 psql: error: connection to server on socket "/var/run/postgresql/.s.PGSQL.5432" failed: FATAL: the database system is in recovery mode
comment:3 by , 18 months ago
Milestone: | PostGIS 3.3.3 → PostGIS 3.1.10 |
---|
follow-up: 6 comment:4 by , 17 months ago
Can you get it to go boom with just the
select st_buffer(centroids, 100) as geom from (values (ST_Point(-10680061.06558879665, 5050512.8441793214945, 3857)),
subquery?
comment:5 by , 17 months ago
Ah, but we do have independent interrupt handling in wagyu… so a second experiment would be: if you comment out all calls to lwgeom_wagyu_interrupt*
in postgis_module.c does the problem go away?
comment:6 by , 17 months ago
Replying to pramsey:
Can you get it to go boom with just the
select st_buffer(centroids, 100) as geom from (values (ST_Point(-10680061.06558879665, 5050512.8441793214945, 3857)),subquery?
In general,
- does just the buffer do it?
- if not, can you materialize the buffers into a table and call just the asmvtgeom and make it happen?
- if not, can you materialize the result of asmvtgeom and make it happen in asmvt?
After 7000 tmux runs, probably about 10 minutes at this point, it has not crashed. I'll rerun the original to confirm it still crashes and then run it without the wagyu interrupt.
comment:7 by , 17 months ago
For record, just so I don't forget how long to wait the mvt crashed at 6892 tmux run
comment:8 by , 17 months ago
After taking out the wagyu interrupt, it's still running after 10000 tmux runs. I'm going to assume it's not going to crash.
comment:9 by , 17 months ago
Wellll, in the short term we should comment out those callbacks, since crashes are worse than (minimal) missing interruptability. In the medium term, I am going to try to make an ASAN build of everything and see if I can induce some kind of out-of-bounds write by wagyu… my working hypothesis being that it does an out-of-bounds write on every interrupt, but only a very small fraction of them manage to overwrite something critical enough to cause a crash.
A stand alone reproduction of the error, specifically using docker-compose but it should work (with some fiddling) on any ubuntu-style postgres install