#3429 closed defect (fixed)
upgrade from 2.1.3 to 2.2+ enters infinite loop
Reported by: | strk | Owned by: | dustymugs |
---|---|---|---|
Priority: | blocker | Milestone: | PostGIS 2.2.2 |
Component: | build | Version: | 2.2.x |
Keywords: | Cc: |
Description
I still have to try at reproducing this but an
ALTER EXTENSION postgis UPDATE
resulted into an infinite loop for me.
Attaching the postgres process with gdb this is the stacktrace:
#0 init_rt_allocator (size=3488) at rt_api.c:880 #1 0x00007f2cf751d5aa in rt_raster_gdal_drivers (drv_count=drv_count@entry=0x7fffbbdd273c, cancc=cancc@entry=0 '\000') at rt_api.c:8975 #2 0x00007f2cf74f2233 in rtpg_assignHookGDALEnabledDrivers () at rt_pg.c:172 #3 _PG_init () at rt_pg.c:291 #4 0x000000000075b8a2 in internal_load_library (libname=libname@entry=0x25b7f68 "/home/postgresql-9.3.4/lib/rtpostgis-2.1.so") at dfmgr.c:284 #5 0x000000000075c2b3 in load_external_function (filename=filename@entry=0x25b7f30 "$libdir/rtpostgis-2.1", funcname=funcname@entry=0x25b7ef8 "RASTER_lib_version", signalNotFound=signalNotFound@entry=1 '\001', filehandle=filehandle@entry=0x7fffbbdd28d8) at dfmgr.c:113
The query I was running:
psql -XAtc "ALTER EXTENSION postgis UPDATE; SELECT postgis_full_version();"
The database was initialized with postgis '2.1.3'. The target library is "2.3.0dev r14582"
Attachments (1)
Change History (40)
comment:1 by , 9 years ago
comment:4 by , 9 years ago
Milestone: | PostGIS 2.3.0 → PostGIS 2.2.2 |
---|
comment:5 by , 9 years ago
Full script to reproduce:
createdb tt psql -c "create extension postgis version '2.1.3'" tt psql -c "alter extension postgis update to '2.2.1'" tt
Tested on PostgreSQL 9.3.6 on x86_64-unknown-linux-gnu, compiled by gcc (Ubuntu 4.8.2-19ubuntu1) 4.8.2, 64-bit
comment:6 by , 9 years ago
OK, so far have not been able to reproduce.
- On OSX 10.11, built with clang/llvm
- Tried both PgSQL 9.3 and 9.4.
- Tried 2.1.3 → 2.2.0, 2.1.3 → 2.2.1, 2.1.8 → 2.2.0, 2.1.3 → 2.2.1
- Using GDAL 2.1.0 (svn head) in all cases
comment:7 by , 9 years ago
I suspect it may have to do with conflicting symbols of the internal raster library statically linked in the two raster modules (from old and new extension) — I will try tomorrow by building with —no-export or whatever it was that fixed the liblwgeom occurrence of similar case (symptoms do look like those ones: big dragons)
comment:8 by , 9 years ago
One additional information: the 2.1.3 install is not fully clean, because the scripts refer to postgis library with "2.1" and the installed "2.1" is actually a 2.1.9dev:
POSTGIS="2.1.9dev r14472" GEOS="3.6.0dev-CAPI-1.10.0 r4138" PROJ="Rel. 4.8.0, 6 March 2012" GDAL="GDAL 1.11.1, released 2014/09/24" LIBXML="2.9.1" LIBJSON="UNKNOWN" (core procs from "2.1.3 r12547" need upgrade) RASTER (raster procs from "2.1.3 r12547" need upgrade)
Anyway, I can also reproduce upgrading from 2.1.9dev, so it's not related.
comment:10 by , 9 years ago
Commenting out the rtpg_assignHookGDALEnabledDrivers();
line in the 2.1 shared object (raster/rt_pg/rt_pg.c:291) fixes the update for me
comment:11 by , 9 years ago
Can be also reproduced with a script-based upgrade, which makes it simpler to debug:
dropdb tt createdb tt psql -f /usr/src/postgis/postgis-2.1/postgis/postgis.sql tt psql -f /usr/src/postgis/postgis-2.1/raster/rt_pg/rtpostgis.sql tt psql -f /usr/src/postgis/postgis-2.2/postgis/postgis_upgrade.sql tt psql -f /usr/src/postgis/postgis-2.2/raster/rt_pg/rtpostgis_upgrade.sql tt
comment:12 by , 9 years ago
Interesting enough, of these two lines:
lwnotice("XXXXX lwnotice 2.1"); elog(NOTICE, "XXXX elog notice 2.1");
I only get the "elog" one, which means "lwnotice" is failing to be effective. Maybe —exclude-libs=ALL is not being effective
comment:13 by , 9 years ago
Commenting out the following line in 2.1 lib fixes the issue for me:
rt_gdaldriver drv_set = rt_raster_gdal_drivers(&drv_count, 0);
comment:14 by , 9 years ago
I'm giving up for today, but I still suspect libraries conflicts. Using gcc (Ubuntu 4.8.4-2ubuntu1~14.04) 4.8.4
It'd be nice to get somebody else to reproduce this
comment:15 by , 9 years ago
We had same problem on Debian Wheezy with PostgreSQL 9.4 and Postgis 2.1 from pgdg repository.
Package updates where installed on which new postgresql-9.4-postgis-2.2 package was installed automatically.
Our DBA tried to update Postgis extension, but some strange errors occurred, like something about changing "postgis.backed" and this infinite loop mentioned in bug report.
perf top showed that postgres process is executing init_rt_allocator, rt_something_allocator functions from both postgis-2.1.so and postgis-2.2.so. Had to sigkill (sigterm, hup, int did not work). Only after removing Postgis 2.1 packages, extension update was successfull.
Problem is that I do not how to reproduces it, because if I install postgresql-9.4-postigs-2.1, 2.2 packages is also automatically installed, so I get Postgis 2.2 extension by default, if I understand correctly.
comment:16 by , 9 years ago
My suspect is with different implementations of the _same_ signature being loaded in memory at the same time (due to both old and new postgis module loaded within the same process) and runtime linker picking one for the other, triggering Really Bad Things like this.
The last time we fixed this by specifying —exlcude-libs (#3281) but I'm not sure it applies here, especially as old packages might still be exposing the methods outside.
I guess one fix could be disabling the major-upgrade check (responsible of loading the old library), at least during EXTENSION upgrade (which should not allow upgrading between major releases, anyway).
comment:17 by , 9 years ago
Another easy way to test:
cd regress ./run_test.pl --extension --upgrade --upgrade-path 2.1.3--2.3.0dev tickets.sql
comment:18 by , 9 years ago
Same infinite loop happens with:
cd regress ./run_test.pl --extension --upgrade --upgrade-path 2.1.3--2.2.2dev tickets.sql
comment:19 by , 9 years ago
Priority: | critical → blocker |
---|---|
Summary: | upgrade from 2.1.3 to 2.3.0dev enters infinite loop → upgrade from 2.1.3 to 2.2+ enters infinite loop |
comment:20 by , 9 years ago
I hit the same scenario, trying to upgrade from 2.1.8 to 2.2.1, both compiled from source, in ubuntu 14.04.
comment:21 by , 9 years ago
I've just got the infinite loop symptom on _create_extension_ (after _drop_extension_, so possibly the library was still loaded as there's no "UNLOAD" command in postgresql).
comment:22 by , 9 years ago
Stripping it down to the minimum amount of SQL, no upgrade required, assuming you have done make install
on both postgis 2.1.8 and svn trunk:
DROP DATABASE IF EXISTS pg21; CREATE DATABASE pg21; \c pg21 SELECT postgis_raster_lib_version(); CREATE EXTENSION postgis VERSION '2.1.8'; CREATE OR REPLACE FUNCTION bytea(raster) RETURNS bytea AS '$libdir/rtpostgis-2.3', 'RASTER_to_bytea' LANGUAGE 'c' IMMUTABLE STRICT;
Basically get the old library loaded, then bring in the new one, which causes the _PG_init() hook to run. Then things fall apart on Ubuntu (but not on OSX)
comment:23 by , 9 years ago
At r14694 I have changed svn/2.3 to do right-away setting of pgsql memory management/messaging, rather than the old complex lazy loading scheme. The ./postgis side has been doing this since forever, so moving raster to do it too is fine, IMO. This was because we were seeing the update process hang at the first call to an allocator. The 2.2/2.1 versions also still have this pattern, so potentially removing it from them as well might be required to be 100% certain that it is not the source of our pain.
comment:24 by , 9 years ago
The current problem seems to be strongly associated with the rtpg_assignHookGDALEnabledDrivers
function. In 2.0 to 2.1 it was called in _PG_init()
directly to read the environment variables and set up GDAL drivers the user wants.
If you remove calls to rtpg_assignHookGDALEnabledDrivers
in 2.3/svn, you can get a clean upgrade from 2.1, no hang. If you change the call to rtpg_assignHookGDALEnabledDrivers
in 2.3/svn from a callback in the GUC setting function to a direct call in _PG_init()
you can also get a clean upgrade from 2.1, no hang.
comment:25 by , 9 years ago
rtpg_assignHookGDALEnabledDrivers
is an interesting function in that it is in the rt_pg
side of the code base, but it also calls rt_raster_gdal_drivers
which lives in the rt_core
side of the code base. So it uses both direct calls to palloc
and calls to allocators via the ctx_t
side route. rtpg_assignHookGDALEnabledDrivers
is not called until after the rt_init_allocators
are set though, so in theory that shouldn't matter.
comment:26 by , 9 years ago
As of r14698 I can still enter the infinite loop. Attaching the looping process with gdb shows this backtrace:
Program received signal SIGINT, Interrupt. rt_set_handlers (allocator=0x7fa43f5315f0 <rt_pg_alloc>, reallocator=0x7fa43f531610 <rt_pg_realloc>, deallocator=0x7fa43f531600 <rt_pg_free>, error_handler=0x7fa43f531c00 <rt_pg_error>, info_handler=0x7fa43f531b60 <rt_pg_debug>, warning_handler=0x7fa43f531ac0 <rt_pg_notice>) at rt_context.c:161 161 ctx_t.info = info_handler; (gdb) bt #0 init_rt_allocator (size=5856) at rt_api.c:880 #1 0x00007fa43822f5fc in rt_raster_gdal_drivers (drv_count=drv_count@entry=0x7fff8425088c, cancc=cancc@entry=0 '\000') at rt_api.c:8975 #2 0x00007fa438204283 in rtpg_assignHookGDALEnabledDrivers () at rt_pg.c:172 #3 _PG_init () at rt_pg.c:291 #4 0x000000000075b8a2 in internal_load_library (libname=libname@entry=0x2324f78 "/home/postgresql-9.3.4/lib/rtpostgis-2.1.so") at dfmgr.c:284 #5 0x000000000075c2b3 in load_external_function (filename=filename@entry=0x2324f40 "$libdir/rtpostgis-2.1", funcname=funcname@entry=0x2324f08 "RASTER_lib_version", signalNotFound=signalNotFound@entry=1 '\001', filehandle=filehandle@entry=0x7fff84250a28) at dfmgr.c:113
That shows 2.1 library being loaded and its _PG_init being called. The 2.1 _PG_init in turn calls rtpg_assignHookGDALEnabledDrivers which calls rt_raster_gdal_drivers in the 2.1 rt_core library which calls init_rt_allocator in the same (2.1) library.
BUT, the interrupt is intercepted by "rt_set_handlers" coming from the _2.3_ library, because "rt_context.c" was not available in 2.1.
Now, ""init_rt_allocator" (in 2.1 pg_core) calls "rt_init_allocators".
The "rt_init_allocators" is in the "rt_pg" library so cannot be resolved at compile time because is just NOT DEFINED when the rt_core library is built. It is thus resolved at run time and ends up finding the one from "2.3" library, generating the mess.
The —exclude-libs switch doesn't help because the symbol being looked for just CANNOT BE internal.
This is my late-evening analisys.
comment:27 by , 9 years ago
So I think one possible fix here would be to NOT rely on an external function for inizializing allocators. This model (currently in use) was long abandoned in liblwgeom, substituted by an initialization function explicitly registering callbacks into the statically linked library.
But I'm afraid this would need change in the old library, which isn't feasible.
Still, if the new library moves to the new model it can stop exporting the 'rt_init_allocators' symbol, and thus the resolver should find the correct one in the old library.
comment:28 by , 9 years ago
Owner: | changed from | to
---|
As a proof of analisys validity, I tried renaming rt_init_allocators to rt_init_allocators23 and I confirm it does fix the issue.
But I'd rather do this right rather than hack a workaround. Doing it right would mean NOT relying on the exposure of client of an "rt_init_allocators" function, but rather take a function pointer from a function. It could be a pointer to an "init_allocators" function for simplificy or even better a proper context context initialization, which would also help playing some with the upcoming multi-threaded PostgreSQL.
I'm reassigning this to Bborie as the raster subsystem maintainer (feel free to unassign).
comment:29 by , 9 years ago
comment:30 by , 9 years ago
I think renaming rt_init_allocators
is completely valid, as that function name is actually a very arbitrary wrapper of the function actually doing the work, rt_set_handlers
. In fact, you'll find other rt_init_allocators
in places like raster2pgsql, it's just a convenience, so why not rename it?
Also, I see no reason why we cannot retrofit the older 2.2 and 2.1 releases to not use the lazy-loading approach to initializing allocators and do it explicitly, it's a relatively small change and everything regresses fine. Since rtpostgis is not a shared library, it's not like we have any external dependency issues on changing the way it works, even in a patch release.
comment:31 by , 9 years ago
Reverting Paul commits still fixes the upgrade for me. See proof in https://github.com/postgis/postgis/pull/98
If we pick the rename it should be:
1) Automated 2) Fully versioned
But I still think it would be better to expose a function taking a callback. As long as we cut calls from core to module, the bug would be fixed. It would be much saner to have the pgmodule (rather than the library) drive operations.
comment:32 by , 9 years ago
I agree it would be OK to backport a fix to the older releases, being the core and pg library always shipped togheter.
comment:33 by , 9 years ago
As of r14704 the upgrade is fixed for me. Tested 2.1.3—2.3.0dev, 2.2.0—2.3.0dev. Yet to test 2.1—2.2.
comment:34 by , 9 years ago
As of r14703 (2.2 branch) upgrade is fixed for me. Tested 2.1.3—2.2.2dev
Green light here to put in NEWS and call it a victory
comment:37 by , 9 years ago
Resolution: | → fixed |
---|---|
Status: | new → closed |
News entries added for 2.1, 2.2.
comment:38 by , 9 years ago
Uhm, I thought nothing was changed in 2.1 — My upgrade tests from 2.0 to 2.1 were _without_ the changes. Not sure why there was no problem there.
comment:39 by , 9 years ago
There was an order-of-operations issue in 2.1 that made cleaning up the memory handler init there seem a good idea too. I'm glad upgrade works without the changes, but having 2.1 cleaned up too makes me feel better. Hate those lazy-loaded allocators.
GDAL="GDAL 1.11.1, released 2014/09/24"