Opened 17 months ago
Closed 3 weeks ago
#5564 closed defect (fixed)
BRIN indexes cause server crash after continuous autovacuum
Reported by: | robe | Owned by: | pramsey |
---|---|---|---|
Priority: | medium | Milestone: | PostGIS 3.4.5 |
Component: | postgis | Version: | 3.1.x |
Keywords: | Cc: |
Description (last modified by )
Under intense conditions of autovacuuming of a geometry BRIN indexed tables, crashes have been found to occur.
Fixes suggested by Giuseppe Broccolo suggested
How to fix:
- Define a MERGE support function and add it in the OpClass
- Define a new UNION support function, similar to the ADD_VALUE one, that override the Postgresql core one in the OpClass (note that the MERGE support function is only used by ADD_VALUE and UNION)
Solution 2. is maybe the most consistent with the current implementation in PostGIS due to the peculiarity of geospatial data type in order to be indexed, that originally brought us to redefine (and simplify compared to Postgresql core version) the ADD_VALUE function. But I would like to hear about other opinions regarding this.
The gdb backtrace of this issue:
#0 FunctionCall2Coll(fcinfo=0x0 collation-collationcentry=0x0 arg =232/28/1391480| arg2=232728/1252344) at tmgr.c:1306\n #1 0x0000000000603533 in brin_inclusion_union (fcinfo=<optimized out>) at brin_inclusion.c:522\n #2 0x0000000000a9be34 in FunctionCal13Coll (flinfo=0×152d885cf98| collation=optimizedout>| arg1=argi@entry=23285305342184| arg2=arg2@entry=23272870259120| arg3=arg3@entry=23272871440776) at fmgr.c:1331\n #3 0x0000000000601ea6 in union_tuples (b=0x152aa33c9278| a=<optimized out>| bdesc=<optimized out>) at brin.c:1651 \n #4 summarize_range CheapNumBlks=53390|heapBIk=43520| heapRel=0x152d88748a98| state=<optimized out>| indexInfo=0x152a33c9518) at brin.c:1460\n #5 brinsummarize index=0x152d88749b48|heapRel=heapRel@entry=0x152d88748a98| pageRange=pageRange@entry=4294967295| include_partial=include_partial@entry-false| numSummarized-numSummarized@entry=0×152aa32a3308| numExisting=numExisting@entry=0x152a32a3308) at brin.c:1542\n #6 0x000000000060219a in brinvacuumcleanup(info=0x7ffd2a9740b0| stats=<optimized out>) at brin.c:957\n #7 0x0000000000652280 in lazy_cleanup_one_index (indrel=0x152088749648| 1stat=0x152aa32@3300| reltuples=reltuples@entry=176649| estimated_count=estimated_count@entry=true| vacrel=vacrel@entry=0x152a33c9628) at vacuumlazy.c:3404\n #8 0x0000000000654f74 in lazy_cleanup_all_indexes (vacrel=8x152aa33c9628) at vacuumlazy.c:3295\n #9 lazy_scan_heap (aggressive=false| params=0x152d88785ac| vacrel=<optimized out>) at vacuumlazy.c: 1851\n #10 heap_vacuum_rel (rel=0x152088748298| params=0x152d8878f5ac| bstrategy=<optimized out>) at vacuumlazy.c:746\n/ #11 0x00000000007bflba in table_relation_vacuum (bstrategy-<optimized out>| params=0x152d8878f5ac| rel=0x152d88748a98) at ../../../src/include/access/tableam.h:1682\n #12 vacuum_rel(relid=1830312108| relation=<optimized out>| params=params@entry=0x152d8878f5ac) at vacuum. c: 2122\n #13 0x00000000007c065c in vacuum (relations=0×152aa33a188| params=params@entry=0×152d8878f5ac| bstrategy=<optimized out>| bstrategy@entry=0×152aa33dba38| isTopLevel=isTopLevel@entry=true) at vacuum. c:494\n #14 0x00000000005d3ba7 in autovacuum_do_vac_analyze (bstrategy=0×152aa33d0038| tab=0x152d8878f5a8 at autovacuum.c:3267\n #15 do_autovacuum ( at autovacuum.c: 2507\n #16 0x00000000005041b7 in AutoVackorkerMain(arg=0x0| a r g = 0 )a t autovacuum.c:1728\n #17 0x00000000008bbob in StartAutoVacWorker( at autovacuum.c:1512\n #18 0x00000000008c3fc in StartAutovacuumWorker ( at postmaster.c:6885\n #19 sigusr1_handler (postgres_signal_arg=<optimized out>) at postmaster.c:6313\
Change History (15)
comment:1 by , 17 months ago
Description: | modified (diff) |
---|
comment:2 by , 17 months ago
comment:3 by , 16 months ago
Hi,
I may help here sine I originally worked on the introduction of BRINs in PostGIS, in 2016.
The attached gdb back trace shows the source of the issue: the crash happens in this line
https://github.com/postgres/postgres/blob/master/src/backend/access/brin/brin_inclusion.c#L522
The same back trace shows what called the UNION support function above: it's the resummarization of the BRIN triggered by an autovacuum (here: https://github.com/postgres/postgres/blob/master/src/backend/access/brin/brin.c#L1542, line number does not perfectly match, but I recognize the called method).
Looks like the problem here is the MERGE support function not properly defined for the OpClass in PostGIS.
I guess in 2016 we never were able to trigger so often autovacuums (and so, many calls of the UNION support function during the resummarization) to reproduce this issue in our tests.
The fix per se is not difficult: the MERGE support function takes only care of updating the bboxes of the merged geometries in the UNION, when the page summary is recalculated. This logic is already implemented in the ADD_VALUE method we defined ad-hoc in PostGIS.
How to fix:
- Define a MERGE support function and add it in the OpClass
- Define a new UNION support function, as similarly done for the ADD_VALUE, that overrides the Postgresql core one in the OpClass (note that the MERGE support function is only used by ADD_VALUE and UNION)
Solution 2. is maybe the most consistent with the current implementation in PostGIS due to the peculiarity of geospatial data type in order to be indexed (handling bboxes etc.), that originally brought us to redefine (and simplify, compared to Postgresql core version) the ADD_VALUE function. But I would like to hear about other opinions regarding this.
comment:4 by , 16 months ago
@pramsey any opinion on which approach you think is best?
@giubro,
So you are proposing solution 2 right?
comment:5 by , 16 months ago
@giubro,
Can you confirm you want to work on this, or @pramsey should take the reins on this.
He offered to work on it if you were too busy with other things to work on this. If you don't respond by the time @pramsey gets back from his conference talks and workshop, which I believe will be 2 days from now, then he will start working on it.
comment:6 by , 5 months ago
Milestone: | PostGIS 3.5.0 → PostGIS Fund Me |
---|
comment:7 by , 5 weeks ago
Not sure if this might be another way of crashing, easier to include if we ever decided to fix this issue. This one reported by @velix on on postgis:osgeo.org matrix/IRC
DROP TABLE IF EXISTS random_points; CREATE TABLE random_points AS SELECT ST_MakePoint(0, 0) AS geom FROM generate_series(1, 130_561); CREATE INDEX ON random_points USING brin(geom);
Takes a second to crash for me.
comment:9 by , 5 weeks ago
Tested
DROP TABLE IF EXISTS random_points; CREATE TABLE random_points AS SELECT ST_MakePoint(0, 0) AS geom FROM generate_series(1, 130_561); CREATE INDEX ON random_points USING brin(geom);
on:
- PG13, 3.0.12dev
- PG14 3.1.12, 3.6.0dev
- PG15 3.5.0
- PG16 3.5.0
and doesn't crash on those, but 130_561 also gives a syntax error on PG13-PG15 above. So had to change that to 130561. 130561 crashes on 17 3.5.0 too.
We know this issue only triggers with parallelism enabled, e.g doing this:
SET max_parallel_workers TO 0;
will prevent it from crashing on PG17. So I guess the trivial issue case only arises on PG17 and higher with parallelism enabled. I presuming because PG17 is the first version to support parallel builds of BRIN as noted in the feature matrix - https://www.postgresql.org/about/featurematrix/#indexing-constraints "Parallelized CREATE INDEX for BRIN indexes"
comment:11 by , 4 weeks ago
Milestone: | PostGIS Fund Me → PostGIS 3.4.5 |
---|
Even though this has been an issue since BRIN was introduced presumably, so ideally we'd want to backport to 3.1 or 3.0 I think since both options would require change in the opclass, we should probably only have it in PostGIS 3.5+