#5385 closed defect (fixed)

Postgres malloc assertion fail when using pg_cancel_backend with ST_AsMVT

Reported by: gbartonowenstl Owned by: pramsey
Priority: high Milestone: PostGIS 3.1.10
Component: postgis Version: 3.3.x
Keywords: malloc, crash Cc: gbartonowenstl

Description

We recently experienced a very odd behaviour where our postgres database would experience a malloc assertion fail which would force the entire database into recovery mode, dropping all currently running queries for 10-20 seconds while it recovers - which, as you can imagine, caused singificant disruption to our service.

The log is as follows:

Apr 20 16:30:46 wag-prod postgres[12134]: [21] <database: - user: - timestamp:2023-04-20 16:30:46.440 UTC - host: - commandtag: - sqlerrorcode:00000 - pid:12134 - processstarttime:2023-04-19 01:54:51 UTC> LOG:  server process (PID 1037998) was terminated by signal 6: Aborted
<database: - user: - timestamp:2023-04-20 16:30:46.440 UTC - host: - commandtag: - sqlerrorcode:00000 - pid:12134 - processstarttime:2023-04-19 01:54:51 UTC> DETAIL:  Failed process was running: -- <URL>
<MVT Query>
Apr 20 16:30:46 wag-prod postgres[12134]: [22] <database: - user: - timestamp:2023-04-20 16:30:46.440 UTC - host: - commandtag: - sqlerrorcode:00000 - pid:12134 - processstarttime:2023-04-19 01:54:51 UTC> LOG:  terminating any other active server processes

bprod 10.82.122.108(57900) SELECT: malloc.c:2617: sysmalloc: Assertion `(old_top == initial_top (av) && old_size == 0) || ((unsigned long) (old_size) >= MINSIZE && prev_inuse (old_top) && ((unsigned long) old_end & (pagesize - 1)) == 0)' failed.

We traced the root cause to our newly increased use of calling pg_cancel_backend() within the narrow window in which ST_AsMVT or ST_AsMVTGeom queries were in flight.

I've created a recreation that calls a stand alone MVT query, and runs pg_cancel_backend in a loop, until we hit this recovery mode case.

Attachments (1)

malloc_reproduction.zip (2.2 KB ) - added by gbartonowenstl 18 months ago.
A stand alone reproduction of the error, specifically using docker-compose but it should work (with some fiddling) on any ubuntu-style postgres install

Download all attachments as: .zip

Change History (16)

by gbartonowenstl, 18 months ago

Attachment: malloc_reproduction.zip added

A stand alone reproduction of the error, specifically using docker-compose but it should work (with some fiddling) on any ubuntu-style postgres install

comment:1 by robe, 18 months ago

@gbartonwenstl,

What does

SELECT postgis_full_version(), version();

return for your system. I'm running your tests against our latest master branch on debian 11. Still waiting for it to crash the backend but it's still running.

comment:2 by robe, 18 months ago

okay it did finally crash

psql:mvt_style_call_minimal_reproduction.sql:53: ERROR:  canceling statement due to user request
5036
psql:mvt_style_call_minimal_reproduction.sql:53: server closed the connection unexpectedly
        This probably means the server terminated abnormally
        before or while processing the request.
psql:mvt_style_call_minimal_reproduction.sql:53: error: connection to server was lost
5037
psql: error: connection to server on socket "/var/run/postgresql/.s.PGSQL.5432" failed: FATAL:  the database system is in recovery mode

comment:3 by robe, 18 months ago

Milestone: PostGIS 3.3.3PostGIS 3.1.10

comment:4 by pramsey, 17 months ago

Can you get it to go boom with just the

select st_buffer(centroids, 100) as geom from (values
            (ST_Point(-10680061.06558879665, 5050512.8441793214945, 3857)),

subquery?

Version 0, edited 17 months ago by pramsey (next)

comment:5 by pramsey, 17 months ago

Ah, but we do have independent interrupt handling in wagyu… so a second experiment would be: if you comment out all calls to lwgeom_wagyu_interrupt* in postgis_module.c does the problem go away?

in reply to:  4 comment:6 by robe, 17 months ago

Replying to pramsey:

Can you get it to go boom with just the

select st_buffer(centroids, 100) as geom from (values
            (ST_Point(-10680061.06558879665, 5050512.8441793214945, 3857)),

subquery?

In general,

  • does just the buffer do it?
  • if not, can you materialize the buffers into a table and call just the asmvtgeom and make it happen?
  • if not, can you materialize the result of asmvtgeom and make it happen in asmvt?

After 7000 tmux runs, probably about 10 minutes at this point, it has not crashed. I'll rerun the original to confirm it still crashes and then run it without the wagyu interrupt.

comment:7 by robe, 17 months ago

For record, just so I don't forget how long to wait the mvt crashed at 6892 tmux run

comment:8 by robe, 17 months ago

After taking out the wagyu interrupt, it's still running after 10000 tmux runs. I'm going to assume it's not going to crash.

comment:9 by pramsey, 17 months ago

Wellll, in the short term we should comment out those callbacks, since crashes are worse than (minimal) missing interruptability. In the medium term, I am going to try to make an ASAN build of everything and see if I can induce some kind of out-of-bounds write by wagyu… my working hypothesis being that it does an out-of-bounds write on every interrupt, but only a very small fraction of them manage to overwrite something critical enough to cause a crash.

comment:10 by robe, 17 months ago

sounds good to me.

comment:11 by Regina Obe <lr@…>, 17 months ago

In c4d01ce3/git:

Prevent crash in mvt
References #5385 for PostGIS 3.4.0

comment:12 by Regina Obe <lr@…>, 17 months ago

In 57d14f0/git:

Prevent crash in mvt
References #5385 for PostGIS 3.3.4

comment:13 by Regina Obe <lr@…>, 17 months ago

In ccb42df/git:

Prevent crash in mvt
References #5385 for PostGIS 3.2.6

comment:14 by Regina Obe <lr@…>, 17 months ago

In 47ccf6ab/git:

Fix previous commit. References #5385 for PostGIS 3.2.6

comment:15 by Regina Obe <lr@…>, 17 months ago

Resolution: fixed
Status: newclosed

In d3b589fc/git:

Prevent crash in mvt
Closes #5385 for PostGIS 3.1.10

Note: See TracTickets for help on using tickets.