#5509 closed defect (fixed)
segfault on dlclose/static deinitialization
Reported by: | springmeyer | Owned by: | warmerdam |
---|---|---|---|
Priority: | normal | Milestone: | |
Component: | default | Version: | unspecified |
Severity: | normal | Keywords: | |
Cc: | Robert Coup |
Description
I've encountered a problem where a segfault happens at program exit after gdal has been unloaded.
Here is a testcase: https://github.com/springmeyer/gdal-atexit-crash
Generally the conditions required to trigger the problem are:
- gdal built --with-threads
- gdal is linked to a library opened with dlopen
- pthread_{get,set}specific() are used within the process by an application that is not GDAL
- dlclose is called and the library linking gdal is unloaded
The crash manifests in pthread_tsd_cleanup when threads are cleaned up.
This feels like the same root issue as: http://trac.osgeo.org/gdal/ticket/4476
Attachments (3)
Change History (21)
comment:1 by , 9 years ago
Cc: | added |
---|
comment:2 by , 9 years ago
comment:3 by , 9 years ago
I'm able to replicate a segfault atexit on ubuntu precise with the stock libgdal-dev
on travis: https://travis-ci.org/springmeyer/gdal-atexit-crash/jobs/26726782, so that is with I presume a lot of drivers. For testing on OS X I've whittled down the drivers to a very reduced test to see if that would avoid the crash, but it does not. The drivers I've reduced to on OS X are: GDAL: vrt gtiff hfa mem raw
and OGR: generic geojson mem kml csv gpx shape vrt openfilegdb mitab
using this patch to gdal 1.11.0: https://github.com/mapnik/mapnik-packaging/blob/master/osx/patches/gdal-1.11.0-minimal.diff.
comment:4 by , 9 years ago
I am finding that gdal-1.10.1 does not trigger this problem, at least on OS X. I've updated the testcase readme accordingly: https://github.com/springmeyer/gdal-atexit-crash/commit/417a0ab520aab4f45ad83361a8518d176a88a791
comment:5 by , 9 years ago
Hi Even,
Can you think of anything more I can or should test? Given the problem occurs with 1.11 and not 1.10 are there specific threading changes that were made between reelease I could try to narrow down to? Do you have any hunch on what is wrong?
comment:6 by , 9 years ago
Dane, I've spend a few hours last night examing this and managed to reduce the issue a bit more, at a point where I would not even link to GDAL, but just invoke the few TLS routines it uses. So there's no reason to think that 1.11 would be worse than 1.10. I'll keep you informed of my findings
comment:7 by , 9 years ago
Hum I think I finally solved the issue. At least I solved one issue...
Attached 2 files that are the most simple way to reproduce the issue I found, with no GDAL or libuv dependencies, and a mode with and without crash. See intructions at top of main.cc And also including the corresponding patch for GDAL that should fix the issue (based on 1.10, should apply hopefully reasonably well on more recent versions). I let you check if that fixes your issue.
The issue is really complex to explain. It is when a thread has caused a TLS variable to be set to non NULL, that this thread is joined by the destructor of the main program, after that the destructor of the plugin has dlclose'd() the library that contains the code of the callback associated to pthread_key_create(). Somehow this callback must be called after the library has been unloaded by dlclose(). The fix is to delete the pthread_key in a destructor of the library, so that no attempt is done to call the callback. Ok, I know this is completely ununderstandable. Hopefully the reduced test case will be clearer, if anyone cares to understand ;-)
by , 9 years ago
Attachment: | temptative_fix_for_5509.patch added |
---|
comment:8 by , 9 years ago
WOW, incredible work reducing this so much further. I think your explanation is clear, but yes, I'm going to need to read it several more times and play with the code to learn more :)
comment:9 by , 9 years ago
Okay, traveling this weekend but had a moment to dig into this more. Nice work! I've confirmed that 1) your testcase works the same on OS X, 2) your patch, applied to gdal 1.11.0, fixes the libuv+gdal testcase I posted (gdal-atexit-crash), and 3) with modifications I can also avoid the crash in the real-world case of using Mapnik with Mapnik's gdal.input plugin from Node.js
The modification I had to make was to expose CPLFinalizeTLS
in the cpl_multiproc.h
header and then do this from Mapnik's gdal plugin:
`
#include "cpl_multiproc.h"
...
const int result_1 = std::atexit(CPLFinalizeTLS);
`
The reason I think this is needed is that I am statically linking libgdal.a into Mapnik's gdal.input (a shared module that is dlopen'ed by libmapnik). Due to the use of static linking the GDALDestroy
in gcore/gdaldllmain.cpp
is never called.
What do you think about exposing either GDALDestroy
or CPLFinalizeTLS
so I can reach them from outside?
comment:10 by , 9 years ago
I definitely don't want to expose CPLFinalizeTLS. GDALDestroy() would be better.
comment:11 by , 9 years ago
comment:12 by , 9 years ago
Tried building against trunk but I'm now seeing (in full Mapnik/Node?.js) testcase:
GDAL: In GDALDestroy - unloading GDAL shared library. FATAL: CPLGetTLSList(): pthread_setspecific() failed! Abort trap: 6
This is after adapting to trunk and doing this in Mapnik's gdal.input:
const int r = std::atexit(GDALDestroy);
Curiously the gdal-atexit-crash testcase is passing without a crash against GDAL trunk.
The backtrace for the new abort is
Thread 0 Crashed:: Dispatch queue: com.apple.main-thread 0 libsystem_kernel.dylib 0x00007fff8e421866 __pthread_kill + 10 1 libsystem_pthread.dylib 0x00007fff8316235c pthread_kill + 92 2 libsystem_c.dylib 0x00007fff8602cb1a abort + 125 3 gdal.input 0x000000010535737f CPLEmergencyError + 127 4 gdal.input 0x0000000105361aff CPLGetTLSList() + 127 5 gdal.input 0x0000000105361a6d CPLGetTLS + 13 6 gdal.input 0x0000000105357324 CPLEmergencyError + 36 7 gdal.input 0x0000000105361aff CPLGetTLSList() + 127 8 gdal.input 0x0000000105361a6d CPLGetTLS + 13 9 gdal.input 0x000000010535442d CPLGetConfigOption + 29 10 gdal.input 0x000000010532dbc7 GDALDestructor() + 23 11 dyld 0x00007fff6e094ef5 ImageLoaderMachO::doTermination(ImageLoader::LinkContext const&) + 215 12 dyld 0x00007fff6e08506d dyld::runImageTerminators(ImageLoader*) + 38 13 dyld 0x00007fff6e08798c dyld::garbageCollectImages() + 497 14 dyld 0x00007fff6e08e9d7 dlclose + 134 15 libdyld.dylib 0x00007fff8fbe179f dlclose + 61 16 libmapnik.dylib 0x00000001041c19d7 mapnik::PluginInfo::~PluginInfo() + 39 17 libmapnik.dylib 0x0000000104121fd8 boost::detail::sp_counted_impl_pd<mapnik::PluginInfo*, boost::detail::sp_ms_deleter<mapnik::PluginInfo> >::dispose() + 24 18 libmapnik.dylib 0x0000000104123167 std::_Rb_tree<std::string, std::pair<std::string const, boost::shared_ptr<mapnik::PluginInfo> >, std::_Select1st<std::pair<std::string const, boost::shared_ptr<mapnik::PluginInfo> > >, std::less<std::string>, std::allocator<std::pair<std::string const, boost::shared_ptr<mapnik::PluginInfo> > > >::_M_erase(std::_Rb_tree_node<std::pair<std::string const, boost::shared_ptr<mapnik::PluginInfo> > >*) + 135 19 libmapnik.dylib 0x000000010411fd58 mapnik::datasource_cache::~datasource_cache() + 40 20 mapnik.node 0x0000000103205775 mapnik::singleton<mapnik::datasource_cache, mapnik::CreateStatic>::DestroySingleton() + 21 21 libsystem_c.dylib 0x00007fff8602d794 __cxa_finalize + 164 22 libsystem_c.dylib 0x00007fff8602da4c exit + 22 23 node 0x000000010000745e node::Exit(v8::Arguments const&) + 53 24 node 0x00000001001757f2 v8::internal::Builtin_HandleApiCall(v8::internal::(anonymous namespace)::BuiltinArguments<(v8::internal::BuiltinExtraArguments)1>, v8::internal::Isolate*) + 482
comment:13 by , 9 years ago
Dane, I believe there's something wrong in your theory about attribute ((destructor)) of GDAL not being used with static linking. The stacktrace shows it is used... My hypothesis is that GDALDestroy() has run before with std::atexit() stuff, and then GDALDestructor() is called, as the TLS has been finalized, CPLGetConfigOption() which uses it crashes.
comment:14 by , 9 years ago
comment:15 by , 9 years ago
Even,
I'm finding that switching to a shared libgdal.dylib/so works around the FATAL: CPLGetTLSList(): pthread_setspecific() failed!
problem and also fixes the overall issue completely and most cleanly because then I don't need to make any changes in Mapnik at all. I can simply upgrade to libgdal master and the atexit crash goes away. Things work with a libgdal.dylib/so such that I can remove my manual calling of std::atexit(GDALDestroy)
in Mapnik gdal.input. So, I am working on this approach now and will then circle back on understanding the FATAL: CPLGetTLSList(): pthread_setspecific() failed!
issue later. One small comment however now is that I don't think the problem is that the GDALDestructor
is getting called twice because even with r27443 I still see FATAL: CPLGetTLSList(): pthread_setspecific() failed!
. My hunch is that the problem is more about the order of things being called after dlclose. But, let me take some time to test a bit more and I'll share more details.
comment:16 by , 8 years ago
springmeyer, before you and Even forget the heavy work that you've done, could you say if things are OK now for closing this ticket?
comment:17 by , 8 years ago
Resolution: | → fixed |
---|---|
Status: | new → closed |
I don't recall the outcome, and if it is still of interest. Closing as keeping it open is useless.
comment:18 by , 8 years ago
Okay, to close. I'll recap my understanding:
- http://trac.osgeo.org/gdal/changeset/27438 fixed the original and most important issue. Now I am able to use libuv and libgdal together without crashes at exit as long as libgdal is built as a dynamic library.
- The remaining issue was that when libgdal is build as a static library I still saw
FATAL: CPLGetTLSList(): pthread_setspecific() failed!
when running the node-mapnik tests. But 1) I've not investigated this deeply, 2) since it can be worked around by using a shared libgdal it is not a blocker for me, and 3) since I could not replicate this problem with Even's reduced testcase it may be hard to fix. I'll create a new issue to track this if I even think it is worth more attention.
Thanks!
For the sake of completness, which GDAL and OGR drivers are built ?