Opened 21 months ago

Closed 20 months ago

Last modified 20 months ago

#3429 closed defect (fixed)

upgrade from 2.1.3 to 2.2+ enters infinite loop

Reported by: strk Owned by: dustymugs
Priority: blocker Milestone: PostGIS 2.2.2
Component: build/upgrade/install Version: 2.2.x
Keywords: Cc:

Description

I still have to try at reproducing this but an

ALTER EXTENSION postgis UPDATE

resulted into an infinite loop for me.

Attaching the postgres process with gdb this is the stacktrace:

#0  init_rt_allocator (size=3488) at rt_api.c:880
#1  0x00007f2cf751d5aa in rt_raster_gdal_drivers (drv_count=drv_count@entry=0x7fffbbdd273c, cancc=cancc@entry=0 '\000') at rt_api.c:8975
#2  0x00007f2cf74f2233 in rtpg_assignHookGDALEnabledDrivers () at rt_pg.c:172
#3  _PG_init () at rt_pg.c:291
#4  0x000000000075b8a2 in internal_load_library (libname=libname@entry=0x25b7f68 "/home/postgresql-9.3.4/lib/rtpostgis-2.1.so") at dfmgr.c:284
#5  0x000000000075c2b3 in load_external_function (filename=filename@entry=0x25b7f30 "$libdir/rtpostgis-2.1",
    funcname=funcname@entry=0x25b7ef8 "RASTER_lib_version", signalNotFound=signalNotFound@entry=1 '\001',
    filehandle=filehandle@entry=0x7fffbbdd28d8) at dfmgr.c:113

The query I was running:

psql -XAtc "ALTER EXTENSION postgis UPDATE; SELECT postgis_full_version();"

The database was initialized with postgis '2.1.3'. The target library is "2.3.0dev r14582"

Attachments (1)

analisys_proof.patch (3.9 KB) - added by strk 20 months ago.
proof of analisys

Download all attachments as: .zip

Change History (40)

comment:1 Changed 21 months ago by strk

GDAL="GDAL 1.11.1, released 2014/09/24"

comment:2 Changed 21 months ago by strk

This is fully reproducible

comment:3 Changed 21 months ago by strk

Same bug when upgrading to 2.2.1 !

comment:4 Changed 21 months ago by strk

Milestone: PostGIS 2.3.0PostGIS 2.2.2

comment:5 Changed 21 months ago by strk

Full script to reproduce:

createdb tt
psql -c "create extension postgis version '2.1.3'" tt
psql -c "alter extension postgis update to '2.2.1'" tt

Tested on PostgreSQL 9.3.6 on x86_64-unknown-linux-gnu, compiled by gcc (Ubuntu 4.8.2-19ubuntu1) 4.8.2, 64-bit

comment:6 Changed 21 months ago by pramsey

OK, so far have not been able to reproduce.

  • On OSX 10.11, built with clang/llvm
  • Tried both PgSQL 9.3 and 9.4.
  • Tried 2.1.3 -> 2.2.0, 2.1.3 -> 2.2.1, 2.1.8 -> 2.2.0, 2.1.3 -> 2.2.1
  • Using GDAL 2.1.0 (svn head) in all cases

comment:7 Changed 21 months ago by strk

I suspect it may have to do with conflicting symbols of the internal raster library statically linked in the two raster modules (from old and new extension) -- I will try tomorrow by building with --no-export or whatever it was that fixed the liblwgeom occurrence of similar case (symptoms do look like those ones: big dragons)

comment:8 Changed 21 months ago by strk

One additional information: the 2.1.3 install is not fully clean, because the scripts refer to postgis library with "2.1" and the installed "2.1" is actually a 2.1.9dev:

 POSTGIS="2.1.9dev r14472" GEOS="3.6.0dev-CAPI-1.10.0 r4138" PROJ="Rel. 4.8.0, 6 March 2012" GDAL="GDAL 1.11.1, released 2014/09/24" LIBXML="2.9.1" LIBJSON="UNKNOWN" (core procs from "2.1.3 r12547" need upgrade) RASTER (raster procs from "2.1.3 r12547" need upgrade)

Anyway, I can also reproduce upgrading from 2.1.9dev, so it's not related.

comment:9 Changed 21 months ago by strk

Paul which 9.3 version exactly did you test ?

comment:10 Changed 21 months ago by strk

Commenting out the rtpg_assignHookGDALEnabledDrivers(); line in the 2.1 shared object (raster/rt_pg/rt_pg.c:291) fixes the update for me

comment:11 Changed 21 months ago by strk

Can be also reproduced with a script-based upgrade, which makes it simpler to debug:

dropdb tt                                                                       
createdb tt

psql -f /usr/src/postgis/postgis-2.1/postgis/postgis.sql tt                     
psql -f /usr/src/postgis/postgis-2.1/raster/rt_pg/rtpostgis.sql tt              
                                                                                
psql -f /usr/src/postgis/postgis-2.2/postgis/postgis_upgrade.sql tt             
psql -f /usr/src/postgis/postgis-2.2/raster/rt_pg/rtpostgis_upgrade.sql tt 

comment:12 Changed 21 months ago by strk

Interesting enough, of these two lines:

  lwnotice("XXXXX lwnotice 2.1");                                               
  elog(NOTICE, "XXXX elog notice 2.1");  

I only get the "elog" one, which means "lwnotice" is failing to be effective. Maybe --exclude-libs=ALL is not being effective

comment:13 Changed 21 months ago by strk

Commenting out the following line in 2.1 lib fixes the issue for me:

    rt_gdaldriver drv_set = rt_raster_gdal_drivers(&drv_count, 0);              

comment:14 Changed 21 months ago by strk

I'm giving up for today, but I still suspect libraries conflicts. Using gcc (Ubuntu 4.8.4-2ubuntu1~14.04) 4.8.4

It'd be nice to get somebody else to reproduce this

comment:15 Changed 21 months ago by Talkless

We had same problem on Debian Wheezy with PostgreSQL 9.4 and Postgis 2.1 from pgdg repository.

Package updates where installed on which new postgresql-9.4-postgis-2.2 package was installed automatically.

Our DBA tried to update Postgis extension, but some strange errors occurred, like something about changing "postgis.backed" and this infinite loop mentioned in bug report.

perf top showed that postgres process is executing init_rt_allocator, rt_something_allocator functions from both postgis-2.1.so and postgis-2.2.so. Had to sigkill (sigterm, hup, int did not work). Only after removing Postgis 2.1 packages, extension update was successfull.

Problem is that I do not how to reproduces it, because if I install postgresql-9.4-postigs-2.1, 2.2 packages is also automatically installed, so I get Postgis 2.2 extension by default, if I understand correctly.

Last edited 21 months ago by Talkless (previous) (diff)

comment:16 Changed 21 months ago by strk

My suspect is with different implementations of the _same_ signature being loaded in memory at the same time (due to both old and new postgis module loaded within the same process) and runtime linker picking one for the other, triggering Really Bad Things like this.

The last time we fixed this by specifying --exlcude-libs (#3281) but I'm not sure it applies here, especially as old packages might still be exposing the methods outside.

I guess one fix could be disabling the major-upgrade check (responsible of loading the old library), at least during EXTENSION upgrade (which should not allow upgrading between major releases, anyway).

comment:17 Changed 21 months ago by strk

Another easy way to test:

cd regress
./run_test.pl --extension --upgrade --upgrade-path 2.1.3--2.3.0dev tickets.sql 

comment:18 Changed 21 months ago by strk

Same infinite loop happens with:

cd regress
./run_test.pl --extension --upgrade --upgrade-path 2.1.3--2.2.2dev tickets.sql

comment:19 Changed 21 months ago by strk

Priority: criticalblocker
Summary: upgrade from 2.1.3 to 2.3.0dev enters infinite loopupgrade from 2.1.3 to 2.2+ enters infinite loop

comment:20 Changed 20 months ago by dbaston

I hit the same scenario, trying to upgrade from 2.1.8 to 2.2.1, both compiled from source, in ubuntu 14.04.

comment:21 Changed 20 months ago by strk

I've just got the infinite loop symptom on _create_extension_ (after _drop_extension_, so possibly the library was still loaded as there's no "UNLOAD" command in postgresql).

comment:22 Changed 20 months ago by pramsey

Stripping it down to the minimum amount of SQL, no upgrade required, assuming you have done make install on both postgis 2.1.8 and svn trunk:

DROP DATABASE IF EXISTS pg21;
CREATE DATABASE pg21;
\c pg21
CREATE EXTENSION postgis VERSION '2.1.8';
SELECT postgis_raster_lib_version();
SELECT pg_backend_pid();
CREATE OR REPLACE FUNCTION bytea(raster)
    RETURNS bytea
    AS '$libdir/rtpostgis-2.3', 'RASTER_to_bytea'
    LANGUAGE 'c' IMMUTABLE STRICT;

Basically get the old library loaded, then bring in the new one, which causes the _PG_init() hook to run. Then things fall apart on Ubuntu (but not on OSX)

Last edited 20 months ago by pramsey (previous) (diff)

comment:23 Changed 20 months ago by pramsey

At r14694 I have changed svn/2.3 to do right-away setting of pgsql memory management/messaging, rather than the old complex lazy loading scheme. The ./postgis side has been doing this since forever, so moving raster to do it too is fine, IMO. This was because we were seeing the update process hang at the first call to an allocator. The 2.2/2.1 versions also still have this pattern, so potentially removing it from them as well might be required to be 100% certain that it is not the source of our pain.

comment:24 Changed 20 months ago by pramsey

The current problem seems to be strongly associated with the rtpg_assignHookGDALEnabledDrivers function. In 2.0 to 2.1 it was called in _PG_init() directly to read the environment variables and set up GDAL drivers the user wants.

If you remove calls to rtpg_assignHookGDALEnabledDrivers in 2.3/svn, you can get a clean upgrade from 2.1, no hang. If you change the call to rtpg_assignHookGDALEnabledDrivers in 2.3/svn from a callback in the GUC setting function to a direct call in _PG_init() you can also get a clean upgrade from 2.1, no hang.

comment:25 Changed 20 months ago by pramsey

rtpg_assignHookGDALEnabledDrivers is an interesting function in that it is in the rt_pg side of the code base, but it also calls rt_raster_gdal_drivers which lives in the rt_core side of the code base. So it uses both direct calls to palloc and calls to allocators via the ctx_t side route. rtpg_assignHookGDALEnabledDrivers is not called until after the rt_init_allocators are set though, so in theory that shouldn't matter.

comment:26 Changed 20 months ago by strk

As of r14698 I can still enter the infinite loop. Attaching the looping process with gdb shows this backtrace:

Program received signal SIGINT, Interrupt.
rt_set_handlers (allocator=0x7fa43f5315f0 <rt_pg_alloc>, reallocator=0x7fa43f531610 <rt_pg_realloc>, deallocator=0x7fa43f531600 <rt_pg_free>,
    error_handler=0x7fa43f531c00 <rt_pg_error>, info_handler=0x7fa43f531b60 <rt_pg_debug>, warning_handler=0x7fa43f531ac0 <rt_pg_notice>)
    at rt_context.c:161
161         ctx_t.info = info_handler;
(gdb) bt
#0  init_rt_allocator (size=5856) at rt_api.c:880
#1  0x00007fa43822f5fc in rt_raster_gdal_drivers (drv_count=drv_count@entry=0x7fff8425088c, cancc=cancc@entry=0 '\000') at rt_api.c:8975
#2  0x00007fa438204283 in rtpg_assignHookGDALEnabledDrivers () at rt_pg.c:172
#3  _PG_init () at rt_pg.c:291
#4  0x000000000075b8a2 in internal_load_library (libname=libname@entry=0x2324f78 "/home/postgresql-9.3.4/lib/rtpostgis-2.1.so") at dfmgr.c:284
#5  0x000000000075c2b3 in load_external_function (filename=filename@entry=0x2324f40 "$libdir/rtpostgis-2.1", 
    funcname=funcname@entry=0x2324f08 "RASTER_lib_version", signalNotFound=signalNotFound@entry=1 '\001', 
    filehandle=filehandle@entry=0x7fff84250a28) at dfmgr.c:113

That shows 2.1 library being loaded and its _PG_init being called. The 2.1 _PG_init in turn calls rtpg_assignHookGDALEnabledDrivers which calls rt_raster_gdal_drivers in the 2.1 rt_core library which calls init_rt_allocator in the same (2.1) library.

BUT, the interrupt is intercepted by "rt_set_handlers" coming from the _2.3_ library, because "rt_context.c" was not available in 2.1.

Now, ""init_rt_allocator" (in 2.1 pg_core) calls "rt_init_allocators".

The "rt_init_allocators" is in the "rt_pg" library so cannot be resolved at compile time because is just NOT DEFINED when the rt_core library is built. It is thus resolved at run time and ends up finding the one from "2.3" library, generating the mess.

The --exclude-libs switch doesn't help because the symbol being looked for just CANNOT BE internal.

This is my late-evening analisys.

comment:27 Changed 20 months ago by strk

So I think one possible fix here would be to NOT rely on an external function for inizializing allocators. This model (currently in use) was long abandoned in liblwgeom, substituted by an initialization function explicitly registering callbacks into the statically linked library.

But I'm afraid this would need change in the old library, which isn't feasible.

Still, if the new library moves to the new model it can stop exporting the 'rt_init_allocators' symbol, and thus the resolver should find the correct one in the old library.

comment:28 Changed 20 months ago by strk

Owner: changed from strk to dustymugs

As a proof of analisys validity, I tried renaming rt_init_allocators to rt_init_allocators23 and I confirm it does fix the issue.

But I'd rather do this right rather than hack a workaround. Doing it right would mean NOT relying on the exposure of client of an "rt_init_allocators" function, but rather take a function pointer from a function. It could be a pointer to an "init_allocators" function for simplificy or even better a proper context context initialization, which would also help playing some with the upcoming multi-threaded PostgreSQL.

I'm reassigning this to Bborie as the raster subsystem maintainer (feel free to unassign).

Changed 20 months ago by strk

Attachment: analisys_proof.patch added

proof of analisys

comment:29 Changed 20 months ago by strk

I had attached my proof-of-analisys patch, if dbaston wants to test it. I had _not_ tried reverting r14697 but I believe it should be fine to do. It should be also fine to revert r14694

comment:30 Changed 20 months ago by pramsey

I think renaming rt_init_allocators is completely valid, as that function name is actually a very arbitrary wrapper of the function actually doing the work, rt_set_handlers. In fact, you'll find other rt_init_allocators in places like raster2pgsql, it's just a convenience, so why not rename it?

Also, I see no reason why we cannot retrofit the older 2.2 and 2.1 releases to not use the lazy-loading approach to initializing allocators and do it explicitly, it's a relatively small change and everything regresses fine. Since rtpostgis is not a shared library, it's not like we have any external dependency issues on changing the way it works, even in a patch release.

comment:31 Changed 20 months ago by strk

Reverting Paul commits still fixes the upgrade for me. See proof in https://github.com/postgis/postgis/pull/98

If we pick the rename it should be:

1) Automated 2) Fully versioned

But I still think it would be better to expose a function taking a callback. As long as we cut calls from core to module, the bug would be fixed. It would be much saner to have the pgmodule (rather than the library) drive operations.

comment:32 Changed 20 months ago by strk

I agree it would be OK to backport a fix to the older releases, being the core and pg library always shipped togheter.

comment:33 Changed 20 months ago by strk

As of r14704 the upgrade is fixed for me. Tested 2.1.3--2.3.0dev, 2.2.0--2.3.0dev. Yet to test 2.1--2.2.

comment:34 Changed 20 months ago by strk

As of r14703 (2.2 branch) upgrade is fixed for me. Tested 2.1.3--2.2.2dev

Green light here to put in NEWS and call it a victory

comment:35 Changed 20 months ago by strk

Did anyone test 2.0--2.1, btw ?

comment:36 Changed 20 months ago by strk

I just tried 2.0.6--2.1.9dev and and it works fine.

comment:37 Changed 20 months ago by pramsey

Resolution: fixed
Status: newclosed

News entries added for 2.1, 2.2.

comment:38 Changed 20 months ago by strk

Uhm, I thought nothing was changed in 2.1 -- My upgrade tests from 2.0 to 2.1 were _without_ the changes. Not sure why there was no problem there.

comment:39 Changed 20 months ago by pramsey

There was an order-of-operations issue in 2.1 that made cleaning up the memory handler init there seem a good idea too. I'm glad upgrade works without the changes, but having 2.1 cleaned up too makes me feel better. Hate those lazy-loaded allocators.

Note: See TracTickets for help on using tickets.