Opened 10 years ago

Closed 9 years ago

Last modified 9 years ago

#5509 closed defect (fixed)

segfault on dlclose/static deinitialization

Reported by: springmeyer Owned by: warmerdam
Priority: normal Milestone:
Component: default Version: unspecified
Severity: normal Keywords:
Cc: Robert Coup

Description

I've encountered a problem where a segfault happens at program exit after gdal has been unloaded.

Here is a testcase: https://github.com/springmeyer/gdal-atexit-crash

Generally the conditions required to trigger the problem are:

  • gdal built --with-threads
  • gdal is linked to a library opened with dlopen
  • pthread_{get,set}specific() are used within the process by an application that is not GDAL
  • dlclose is called and the library linking gdal is unloaded

The crash manifests in pthread_tsd_cleanup when threads are cleaned up.

This feels like the same root issue as: http://trac.osgeo.org/gdal/ticket/4476

Attachments (3)

temptative_fix_for_5509.patch (2.9 KB ) - added by Even Rouault 10 years ago.
shared_object.cc (700 bytes ) - added by Even Rouault 10 years ago.
Simple library to reproduce the issue
main.cc (1.5 KB ) - added by Even Rouault 10 years ago.
Simple main to reproduce the issue

Download all attachments as: .zip

Change History (21)

comment:1 by Robert Coup, 10 years ago

Cc: Robert Coup added

comment:2 by Even Rouault, 10 years ago

For the sake of completness, which GDAL and OGR drivers are built ?

comment:3 by springmeyer, 10 years ago

I'm able to replicate a segfault atexit on ubuntu precise with the stock libgdal-dev on travis: https://travis-ci.org/springmeyer/gdal-atexit-crash/jobs/26726782, so that is with I presume a lot of drivers. For testing on OS X I've whittled down the drivers to a very reduced test to see if that would avoid the crash, but it does not. The drivers I've reduced to on OS X are: GDAL: vrt gtiff hfa mem raw and OGR: generic geojson mem kml csv gpx shape vrt openfilegdb mitab using this patch to gdal 1.11.0: https://github.com/mapnik/mapnik-packaging/blob/master/osx/patches/gdal-1.11.0-minimal.diff.

comment:4 by springmeyer, 10 years ago

I am finding that gdal-1.10.1 does not trigger this problem, at least on OS X. I've updated the testcase readme accordingly: https://github.com/springmeyer/gdal-atexit-crash/commit/417a0ab520aab4f45ad83361a8518d176a88a791

comment:5 by springmeyer, 10 years ago

Hi Even,

Can you think of anything more I can or should test? Given the problem occurs with 1.11 and not 1.10 are there specific threading changes that were made between reelease I could try to narrow down to? Do you have any hunch on what is wrong?

comment:6 by Even Rouault, 10 years ago

Dane, I've spend a few hours last night examing this and managed to reduce the issue a bit more, at a point where I would not even link to GDAL, but just invoke the few TLS routines it uses. So there's no reason to think that 1.11 would be worse than 1.10. I'll keep you informed of my findings

comment:7 by Even Rouault, 10 years ago

Hum I think I finally solved the issue. At least I solved one issue...

Attached 2 files that are the most simple way to reproduce the issue I found, with no GDAL or libuv dependencies, and a mode with and without crash. See intructions at top of main.cc And also including the corresponding patch for GDAL that should fix the issue (based on 1.10, should apply hopefully reasonably well on more recent versions). I let you check if that fixes your issue.

The issue is really complex to explain. It is when a thread has caused a TLS variable to be set to non NULL, that this thread is joined by the destructor of the main program, after that the destructor of the plugin has dlclose'd() the library that contains the code of the callback associated to pthread_key_create(). Somehow this callback must be called after the library has been unloaded by dlclose(). The fix is to delete the pthread_key in a destructor of the library, so that no attempt is done to call the callback. Ok, I know this is completely ununderstandable. Hopefully the reduced test case will be clearer, if anyone cares to understand ;-)

by Even Rouault, 10 years ago

by Even Rouault, 10 years ago

Attachment: shared_object.cc added

Simple library to reproduce the issue

by Even Rouault, 10 years ago

Attachment: main.cc added

Simple main to reproduce the issue

comment:8 by springmeyer, 10 years ago

WOW, incredible work reducing this so much further. I think your explanation is clear, but yes, I'm going to need to read it several more times and play with the code to learn more :)

comment:9 by springmeyer, 10 years ago

Okay, traveling this weekend but had a moment to dig into this more. Nice work! I've confirmed that 1) your testcase works the same on OS X, 2) your patch, applied to gdal 1.11.0, fixes the libuv+gdal testcase I posted (gdal-atexit-crash), and 3) with modifications I can also avoid the crash in the real-world case of using Mapnik with Mapnik's gdal.input plugin from Node.js

The modification I had to make was to expose CPLFinalizeTLS in the cpl_multiproc.h header and then do this from Mapnik's gdal plugin:

` #include "cpl_multiproc.h" ... const int result_1 = std::atexit(CPLFinalizeTLS); `

The reason I think this is needed is that I am statically linking libgdal.a into Mapnik's gdal.input (a shared module that is dlopen'ed by libmapnik). Due to the use of static linking the GDALDestroy in gcore/gdaldllmain.cpp is never called.

What do you think about exposing either GDALDestroy or CPLFinalizeTLS so I can reach them from outside?

comment:10 by Even Rouault, 10 years ago

I definitely don't want to expose CPLFinalizeTLS. GDALDestroy() would be better.

comment:11 by Even Rouault, 10 years ago

trunk r27438 "Fix crashing issue with TLS finalization on Unix (#5509)"

I'm a bit hesitant to backport to 1.11 for now. Let's check if it doesn't cause issue to Windows builds

comment:12 by springmeyer, 10 years ago

Tried building against trunk but I'm now seeing (in full Mapnik/Node?.js) testcase:

GDAL: In GDALDestroy - unloading GDAL shared library.
FATAL: CPLGetTLSList(): pthread_setspecific() failed!
Abort trap: 6

This is after adapting to trunk and doing this in Mapnik's gdal.input:

const int r = std::atexit(GDALDestroy);

Curiously the gdal-atexit-crash testcase is passing without a crash against GDAL trunk.

The backtrace for the new abort is

Thread 0 Crashed:: Dispatch queue: com.apple.main-thread
0   libsystem_kernel.dylib        	0x00007fff8e421866 __pthread_kill + 10
1   libsystem_pthread.dylib       	0x00007fff8316235c pthread_kill + 92
2   libsystem_c.dylib             	0x00007fff8602cb1a abort + 125
3   gdal.input                    	0x000000010535737f CPLEmergencyError + 127
4   gdal.input                    	0x0000000105361aff CPLGetTLSList() + 127
5   gdal.input                    	0x0000000105361a6d CPLGetTLS + 13
6   gdal.input                    	0x0000000105357324 CPLEmergencyError + 36
7   gdal.input                    	0x0000000105361aff CPLGetTLSList() + 127
8   gdal.input                    	0x0000000105361a6d CPLGetTLS + 13
9   gdal.input                    	0x000000010535442d CPLGetConfigOption + 29
10  gdal.input                    	0x000000010532dbc7 GDALDestructor() + 23
11  dyld                          	0x00007fff6e094ef5 ImageLoaderMachO::doTermination(ImageLoader::LinkContext const&) + 215
12  dyld                          	0x00007fff6e08506d dyld::runImageTerminators(ImageLoader*) + 38
13  dyld                          	0x00007fff6e08798c dyld::garbageCollectImages() + 497
14  dyld                          	0x00007fff6e08e9d7 dlclose + 134
15  libdyld.dylib                 	0x00007fff8fbe179f dlclose + 61
16  libmapnik.dylib               	0x00000001041c19d7 mapnik::PluginInfo::~PluginInfo() + 39
17  libmapnik.dylib               	0x0000000104121fd8 boost::detail::sp_counted_impl_pd<mapnik::PluginInfo*, boost::detail::sp_ms_deleter<mapnik::PluginInfo> >::dispose() + 24
18  libmapnik.dylib               	0x0000000104123167 std::_Rb_tree<std::string, std::pair<std::string const, boost::shared_ptr<mapnik::PluginInfo> >, std::_Select1st<std::pair<std::string const, boost::shared_ptr<mapnik::PluginInfo> > >, std::less<std::string>, std::allocator<std::pair<std::string const, boost::shared_ptr<mapnik::PluginInfo> > > >::_M_erase(std::_Rb_tree_node<std::pair<std::string const, boost::shared_ptr<mapnik::PluginInfo> > >*) + 135
19  libmapnik.dylib               	0x000000010411fd58 mapnik::datasource_cache::~datasource_cache() + 40
20  mapnik.node                   	0x0000000103205775 mapnik::singleton<mapnik::datasource_cache, mapnik::CreateStatic>::DestroySingleton() + 21
21  libsystem_c.dylib             	0x00007fff8602d794 __cxa_finalize + 164
22  libsystem_c.dylib             	0x00007fff8602da4c exit + 22
23  node                          	0x000000010000745e node::Exit(v8::Arguments const&) + 53
24  node                          	0x00000001001757f2 v8::internal::Builtin_HandleApiCall(v8::internal::(anonymous namespace)::BuiltinArguments<(v8::internal::BuiltinExtraArguments)1>, v8::internal::Isolate*) + 482

comment:13 by Even Rouault, 10 years ago

Dane, I believe there's something wrong in your theory about attribute ((destructor)) of GDAL not being used with static linking. The stacktrace shows it is used... My hypothesis is that GDALDestroy() has run before with std::atexit() stuff, and then GDALDestructor() is called, as the TLS has been finalized, CPLGetConfigOption() which uses it crashes.

comment:14 by Even Rouault, 10 years ago

The following commit cannot hurt... : trunk r27443 "Avoid crash in GDALDestructor() if it is called implicitely by OS mechanisms whereas GDALDestroy() has already been called (#5509)"

comment:15 by springmeyer, 10 years ago

Even,

I'm finding that switching to a shared libgdal.dylib/so works around the FATAL: CPLGetTLSList(): pthread_setspecific() failed! problem and also fixes the overall issue completely and most cleanly because then I don't need to make any changes in Mapnik at all. I can simply upgrade to libgdal master and the atexit crash goes away. Things work with a libgdal.dylib/so such that I can remove my manual calling of std::atexit(GDALDestroy) in Mapnik gdal.input. So, I am working on this approach now and will then circle back on understanding the FATAL: CPLGetTLSList(): pthread_setspecific() failed! issue later. One small comment however now is that I don't think the problem is that the GDALDestructor is getting called twice because even with r27443 I still see FATAL: CPLGetTLSList(): pthread_setspecific() failed!. My hunch is that the problem is more about the order of things being called after dlclose. But, let me take some time to test a bit more and I'll share more details.

comment:16 by Jukka Rahkonen, 9 years ago

springmeyer, before you and Even forget the heavy work that you've done, could you say if things are OK now for closing this ticket?

comment:17 by Even Rouault, 9 years ago

Resolution: fixed
Status: newclosed

I don't recall the outcome, and if it is still of interest. Closing as keeping it open is useless.

comment:18 by springmeyer, 9 years ago

Okay, to close. I'll recap my understanding:

  • http://trac.osgeo.org/gdal/changeset/27438 fixed the original and most important issue. Now I am able to use libuv and libgdal together without crashes at exit as long as libgdal is built as a dynamic library.
  • The remaining issue was that when libgdal is build as a static library I still saw FATAL: CPLGetTLSList(): pthread_setspecific() failed! when running the node-mapnik tests. But 1) I've not investigated this deeply, 2) since it can be worked around by using a shared libgdal it is not a blocker for me, and 3) since I could not replicate this problem with Even's reduced testcase it may be hard to fix. I'll create a new issue to track this if I even think it is worth more attention.

Thanks!

Note: See TracTickets for help on using tickets.