Opened 9 months ago

Last modified 9 months ago

#5564 new defect

BRIN indexes cause server crash after continuous autovacuum

Reported by: robe Owned by: pramsey
Priority: medium Milestone: PostGIS 3.5.0
Component: postgis Version: 3.1.x
Keywords: Cc:

Description (last modified by robe)

Under intense conditions of autovacuuming of a geometry BRIN indexed tables, crashes have been found to occur.

Fixes suggested by Giuseppe Broccolo suggested

How to fix:

  1. Define a MERGE support function and add it in the OpClass
  2. Define a new UNION support function, similar to the ADD_VALUE one, that override the Postgresql core one in the OpClass (note that the MERGE support function is only used by ADD_VALUE and UNION)

Solution 2. is maybe the most consistent with the current implementation in PostGIS due to the peculiarity of geospatial data type in order to be indexed, that originally brought us to redefine (and simplify compared to Postgresql core version) the ADD_VALUE function. But I would like to hear about other opinions regarding this.

The gdb backtrace of this issue:

#0 FunctionCall2Coll(fcinfo=0x0 collation-collationcentry=0x0 arg =232/28/1391480| arg2=232728/1252344) at tmgr.c:1306\n
#1 0x0000000000603533 in brin_inclusion_union (fcinfo=<optimized out>) at brin_inclusion.c:522\n
#2 0x0000000000a9be34 in FunctionCal13Coll (flinfo=0×152d885cf98| collation=optimizedout>|
arg1=argi@entry=23285305342184| arg2=arg2@entry=23272870259120| arg3=arg3@entry=23272871440776) at
fmgr.c:1331\n
#3 0x0000000000601ea6 in union_tuples (b=0x152aa33c9278| a=<optimized out>| bdesc=<optimized
out>) at brin.c:1651 \n
#4 summarize_range CheapNumBlks=53390|heapBIk=43520| heapRel=0x152d88748a98|
state=<optimized out>| indexInfo=0x152a33c9518) at brin.c:1460\n
#5 brinsummarize index=0x152d88749b48|heapRel=heapRel@entry=0x152d88748a98| pageRange=pageRange@entry=4294967295|
include_partial=include_partial@entry-false| numSummarized-numSummarized@entry=0×152aa32a3308|
numExisting=numExisting@entry=0x152a32a3308) at brin.c:1542\n
#6 0x000000000060219a in brinvacuumcleanup(info=0x7ffd2a9740b0| stats=<optimized out>) at brin.c:957\n
#7 0x0000000000652280 in lazy_cleanup_one_index
(indrel=0x152088749648| 1stat=0x152aa32@3300| reltuples=reltuples@entry=176649|
estimated_count=estimated_count@entry=true| vacrel=vacrel@entry=0x152a33c9628) at vacuumlazy.c:3404\n
#8 0x0000000000654f74 in lazy_cleanup_all_indexes (vacrel=8x152aa33c9628) at vacuumlazy.c:3295\n
#9 lazy_scan_heap (aggressive=false| params=0x152d88785ac| vacrel=<optimized out>) at vacuumlazy.c: 1851\n
#10 heap_vacuum_rel (rel=0x152088748298| params=0x152d8878f5ac| bstrategy=<optimized out>) at
vacuumlazy.c:746\n/
#11 0x00000000007bflba in table_relation_vacuum (bstrategy-<optimized out>|
params=0x152d8878f5ac| rel=0x152d88748a98) at ../../../src/include/access/tableam.h:1682\n
#12 vacuum_rel(relid=1830312108| relation=<optimized out>| params=params@entry=0x152d8878f5ac) at vacuum. c: 2122\n
#13 0x00000000007c065c in vacuum (relations=0×152aa33a188| params=params@entry=0×152d8878f5ac|
bstrategy=<optimized out>| bstrategy@entry=0×152aa33dba38| isTopLevel=isTopLevel@entry=true) at
vacuum. c:494\n
#14 0x00000000005d3ba7 in autovacuum_do_vac_analyze (bstrategy=0×152aa33d0038|
tab=0x152d8878f5a8 at autovacuum.c:3267\n
#15 do_autovacuum ( at autovacuum.c: 2507\n
#16 0x00000000005041b7 in AutoVackorkerMain(arg=0x0| a r g = 0 )a t autovacuum.c:1728\n
#17 0x00000000008bbob in StartAutoVacWorker( at autovacuum.c:1512\n
#18 0x00000000008c3fc in StartAutovacuumWorker ( at postmaster.c:6885\n
#19 sigusr1_handler (postgres_signal_arg=<optimized out>) at postmaster.c:6313\


Change History (5)

comment:1 by robe, 9 months ago

Description: modified (diff)

comment:2 by robe, 9 months ago

Even though this has been an issue since BRIN was introduced presumably, so ideally we'd want to backport to 3.1 or 3.0 I think since both options would require change in the opclass, we should probably only have it in PostGIS 3.5+

comment:3 by giubro, 9 months ago

Hi,

I may help here sine I originally worked on the introduction of BRINs in PostGIS, in 2016.

The attached gdb back trace shows the source of the issue: the crash happens in this line

https://github.com/postgres/postgres/blob/master/src/backend/access/brin/brin_inclusion.c#L522

The same back trace shows what called the UNION support function above: it's the resummarization of the BRIN triggered by an autovacuum (here: https://github.com/postgres/postgres/blob/master/src/backend/access/brin/brin.c#L1542, line number does not perfectly match, but I recognize the called method).

Looks like the problem here is the MERGE support function not properly defined for the OpClass in PostGIS.

I guess in 2016 we never were able to trigger so often autovacuums (and so, many calls of the UNION support function during the resummarization) to reproduce this issue in our tests.

The fix per se is not difficult: the MERGE support function takes only care of updating the bboxes of the merged geometries in the UNION, when the page summary is recalculated. This logic is already implemented in the ADD_VALUE method we defined ad-hoc in PostGIS.

How to fix:

  1. Define a MERGE support function and add it in the OpClass
  2. Define a new UNION support function, as similarly done for the ADD_VALUE, that overrides the Postgresql core one in the OpClass (note that the MERGE support function is only used by ADD_VALUE and UNION)

Solution 2. is maybe the most consistent with the current implementation in PostGIS due to the peculiarity of geospatial data type in order to be indexed (handling bboxes etc.), that originally brought us to redefine (and simplify, compared to Postgresql core version) the ADD_VALUE function. But I would like to hear about other opinions regarding this.

comment:4 by robe, 9 months ago

@pramsey any opinion on which approach you think is best?

@giubro,

So you are proposing solution 2 right?

comment:5 by robe, 9 months ago

@giubro,

Can you confirm you want to work on this, or @pramsey should take the reins on this.

He offered to work on it if you were too busy with other things to work on this. If you don't respond by the time @pramsey gets back from his conference talks and workshop, which I believe will be 2 days from now, then he will start working on it.

Note: See TracTickets for help on using tickets.