Opened 17 months ago

Closed 3 weeks ago

#5564 closed defect (fixed)

BRIN indexes cause server crash after continuous autovacuum

Reported by: robe Owned by: pramsey
Priority: medium Milestone: PostGIS 3.4.5
Component: postgis Version: 3.1.x
Keywords: Cc:

Description (last modified by robe)

Under intense conditions of autovacuuming of a geometry BRIN indexed tables, crashes have been found to occur.

Fixes suggested by Giuseppe Broccolo suggested

How to fix:

  1. Define a MERGE support function and add it in the OpClass
  2. Define a new UNION support function, similar to the ADD_VALUE one, that override the Postgresql core one in the OpClass (note that the MERGE support function is only used by ADD_VALUE and UNION)

Solution 2. is maybe the most consistent with the current implementation in PostGIS due to the peculiarity of geospatial data type in order to be indexed, that originally brought us to redefine (and simplify compared to Postgresql core version) the ADD_VALUE function. But I would like to hear about other opinions regarding this.

The gdb backtrace of this issue:

#0 FunctionCall2Coll(fcinfo=0x0 collation-collationcentry=0x0 arg =232/28/1391480| arg2=232728/1252344) at tmgr.c:1306\n
#1 0x0000000000603533 in brin_inclusion_union (fcinfo=<optimized out>) at brin_inclusion.c:522\n
#2 0x0000000000a9be34 in FunctionCal13Coll (flinfo=0×152d885cf98| collation=optimizedout>|
arg1=argi@entry=23285305342184| arg2=arg2@entry=23272870259120| arg3=arg3@entry=23272871440776) at
fmgr.c:1331\n
#3 0x0000000000601ea6 in union_tuples (b=0x152aa33c9278| a=<optimized out>| bdesc=<optimized
out>) at brin.c:1651 \n
#4 summarize_range CheapNumBlks=53390|heapBIk=43520| heapRel=0x152d88748a98|
state=<optimized out>| indexInfo=0x152a33c9518) at brin.c:1460\n
#5 brinsummarize index=0x152d88749b48|heapRel=heapRel@entry=0x152d88748a98| pageRange=pageRange@entry=4294967295|
include_partial=include_partial@entry-false| numSummarized-numSummarized@entry=0×152aa32a3308|
numExisting=numExisting@entry=0x152a32a3308) at brin.c:1542\n
#6 0x000000000060219a in brinvacuumcleanup(info=0x7ffd2a9740b0| stats=<optimized out>) at brin.c:957\n
#7 0x0000000000652280 in lazy_cleanup_one_index
(indrel=0x152088749648| 1stat=0x152aa32@3300| reltuples=reltuples@entry=176649|
estimated_count=estimated_count@entry=true| vacrel=vacrel@entry=0x152a33c9628) at vacuumlazy.c:3404\n
#8 0x0000000000654f74 in lazy_cleanup_all_indexes (vacrel=8x152aa33c9628) at vacuumlazy.c:3295\n
#9 lazy_scan_heap (aggressive=false| params=0x152d88785ac| vacrel=<optimized out>) at vacuumlazy.c: 1851\n
#10 heap_vacuum_rel (rel=0x152088748298| params=0x152d8878f5ac| bstrategy=<optimized out>) at
vacuumlazy.c:746\n/
#11 0x00000000007bflba in table_relation_vacuum (bstrategy-<optimized out>|
params=0x152d8878f5ac| rel=0x152d88748a98) at ../../../src/include/access/tableam.h:1682\n
#12 vacuum_rel(relid=1830312108| relation=<optimized out>| params=params@entry=0x152d8878f5ac) at vacuum. c: 2122\n
#13 0x00000000007c065c in vacuum (relations=0×152aa33a188| params=params@entry=0×152d8878f5ac|
bstrategy=<optimized out>| bstrategy@entry=0×152aa33dba38| isTopLevel=isTopLevel@entry=true) at
vacuum. c:494\n
#14 0x00000000005d3ba7 in autovacuum_do_vac_analyze (bstrategy=0×152aa33d0038|
tab=0x152d8878f5a8 at autovacuum.c:3267\n
#15 do_autovacuum ( at autovacuum.c: 2507\n
#16 0x00000000005041b7 in AutoVackorkerMain(arg=0x0| a r g = 0 )a t autovacuum.c:1728\n
#17 0x00000000008bbob in StartAutoVacWorker( at autovacuum.c:1512\n
#18 0x00000000008c3fc in StartAutovacuumWorker ( at postmaster.c:6885\n
#19 sigusr1_handler (postgres_signal_arg=<optimized out>) at postmaster.c:6313\


Change History (15)

comment:1 by robe, 17 months ago

Description: modified (diff)

comment:2 by robe, 17 months ago

Even though this has been an issue since BRIN was introduced presumably, so ideally we'd want to backport to 3.1 or 3.0 I think since both options would require change in the opclass, we should probably only have it in PostGIS 3.5+

comment:3 by giubro, 16 months ago

Hi,

I may help here sine I originally worked on the introduction of BRINs in PostGIS, in 2016.

The attached gdb back trace shows the source of the issue: the crash happens in this line

https://github.com/postgres/postgres/blob/master/src/backend/access/brin/brin_inclusion.c#L522

The same back trace shows what called the UNION support function above: it's the resummarization of the BRIN triggered by an autovacuum (here: https://github.com/postgres/postgres/blob/master/src/backend/access/brin/brin.c#L1542, line number does not perfectly match, but I recognize the called method).

Looks like the problem here is the MERGE support function not properly defined for the OpClass in PostGIS.

I guess in 2016 we never were able to trigger so often autovacuums (and so, many calls of the UNION support function during the resummarization) to reproduce this issue in our tests.

The fix per se is not difficult: the MERGE support function takes only care of updating the bboxes of the merged geometries in the UNION, when the page summary is recalculated. This logic is already implemented in the ADD_VALUE method we defined ad-hoc in PostGIS.

How to fix:

  1. Define a MERGE support function and add it in the OpClass
  2. Define a new UNION support function, as similarly done for the ADD_VALUE, that overrides the Postgresql core one in the OpClass (note that the MERGE support function is only used by ADD_VALUE and UNION)

Solution 2. is maybe the most consistent with the current implementation in PostGIS due to the peculiarity of geospatial data type in order to be indexed (handling bboxes etc.), that originally brought us to redefine (and simplify, compared to Postgresql core version) the ADD_VALUE function. But I would like to hear about other opinions regarding this.

comment:4 by robe, 16 months ago

@pramsey any opinion on which approach you think is best?

@giubro,

So you are proposing solution 2 right?

comment:5 by robe, 16 months ago

@giubro,

Can you confirm you want to work on this, or @pramsey should take the reins on this.

He offered to work on it if you were too busy with other things to work on this. If you don't respond by the time @pramsey gets back from his conference talks and workshop, which I believe will be 2 days from now, then he will start working on it.

comment:6 by robe, 5 months ago

Milestone: PostGIS 3.5.0PostGIS Fund Me

comment:7 by robe, 5 weeks ago

Not sure if this might be another way of crashing, easier to include if we ever decided to fix this issue. This one reported by @velix on on postgis:osgeo.org matrix/IRC

DROP TABLE IF EXISTS random_points;
CREATE TABLE random_points AS
SELECT ST_MakePoint(0, 0) AS geom FROM generate_series(1, 130_561);
CREATE INDEX ON random_points USING brin(geom);

Takes a second to crash for me.

comment:8 by robe, 5 weeks ago

FWIW, changing the number to 130_560 makes it not crash.

comment:9 by robe, 5 weeks ago

Tested

DROP TABLE IF EXISTS random_points;
CREATE TABLE random_points AS
SELECT ST_MakePoint(0, 0) AS geom FROM generate_series(1, 130_561);
CREATE INDEX ON random_points USING brin(geom);

on:

  • PG13, 3.0.12dev
  • PG14 3.1.12, 3.6.0dev
  • PG15 3.5.0
  • PG16 3.5.0

and doesn't crash on those, but 130_561 also gives a syntax error on PG13-PG15 above. So had to change that to 130561. 130561 crashes on 17 3.5.0 too.

We know this issue only triggers with parallelism enabled, e.g doing this:

SET max_parallel_workers TO 0;

will prevent it from crashing on PG17. So I guess the trivial issue case only arises on PG17 and higher with parallelism enabled. I presuming because PG17 is the first version to support parallel builds of BRIN as noted in the feature matrix - https://www.postgresql.org/about/featurematrix/#indexing-constraints "Parallelized CREATE INDEX for BRIN indexes"

comment:11 by robe, 4 weeks ago

Milestone: PostGIS Fund MePostGIS 3.4.5

comment:12 by Paul Ramsey <pramsey@…>, 4 weeks ago

In 35706c2/git:

Add merge AM support function (11) for all BRIN operator class
families. References #5564. Supports parallel BRIN build on PG17+
and in general should fix some longstanding low-probability crash
cases related to BRIN.

comment:13 by Regina Obe <lr@…>, 4 weeks ago

In 5a03c5e/git:

BRIN Merge Support

  • Add merge AM support function (11) for all BRIN operator class families. Supports parallel BRIN build on PG17+ and in general should fix some longstanding low-probability crash cases related to BRIN.
  • Add upgrade logic to shimmy in the brin_inclusion_merge functions
  • Add crash tests

References #5564 for PostGIS 3.5.2

comment:14 by Regina Obe <lr@…>, 4 weeks ago

In f88f26e/git:

NEWS update, References #5564 for PostGIS 3.5.2

comment:15 by Regina Obe <lr@…>, 3 weeks ago

Resolution: fixed
Status: newclosed

In a02435c/git:

BRIN Merge Support

  • Add merge AM support function (11) for all BRIN operator class families. Supports parallel BRIN build on PG17+ and in general should fix some longstanding low-probability crash cases related to BRIN.
  • Add upgrade logic to shimmy in the brin_inclusion_merge functions
  • Add crash tests

Closes #5564 for PostGIS 3.4.5

Note: See TracTickets for help on using tickets.