Opened 3 years ago

Closed 2 years ago

#5366 closed defect (duplicate)

Access violation (Crash) with OGRCreateCoordinateTransformation

Reported by: vindoctor Owned by: warmerdam
Priority: highest Milestone:
Component: GDAL_Raster Version: svn-trunk
Severity: blocker Keywords: OGRCreateCoordinateTransformation crash mulit threaded



Thanks for fixing/backing out the changes with the raster warping! Just tested it and it works great. This new issue is a bit harder to narrowing down the fix. Note, my code worked for years, so its a recent change within gdal. Our app can open(shared) and close a vector datasource at the same time with multiple threads. If the layer in the datasource is not in the projection needed, its then projected within these threads. This crash is in an area that kind of looks like resource management/nodes to keep down memory fragmentation or the overall memory footprint? I'm hoping you guys will know what has changed recently and where to start looking, or provide me some guidance on how to get more detail to what leads up to this. This below is a copy/paste out of the vs2012's callstack window. I do notice my other threads also have this datasource open. Maybe I add some critical sections around something within my app to try to narrow down the cause (not sure what I would protected just yet to debug this). Note, its always crashing with this same callstack.

ntdll.dll!77ba2685() Unknown ntdll.dll!77ba2685() Unknown

[Frames below may be incorrect and/or missing, no symbols loaded for ntdll.dll] ntdll.dll!77ba25df() Unknown msvcr110.dll!malloc(unsigned int size=434571600) Line 91 msvcr110.dll!realloc(void * pBlock=0x19e70950, unsigned int newsize=434571600) Line 85 gdal110.dll!VSIRealloc(void * pData=0x19e70950, unsigned int nNewSize=8) Line 622 gdal110.dll!CPLRealloc(void * pData=0x19e70950, unsigned int nNewSize=8) Line 199 gdal110.dll!OGR_SRSNode::InsertChild?(OGR_SRSNode * poNew=0x25a12958, int iChild=1) Line 241 gdal110.dll!OGR_SRSNode::AddChild?(OGR_SRSNode * poNew=0x25a12958) Line 214 gdal110.dll!OGR_SRSNode::Clone() Line 357 gdal110.dll!OGR_SRSNode::Clone() Line 356 gdal110.dll!OGR_SRSNode::Clone() Line 356 gdal110.dll!OGR_SRSNode::Clone() Line 356 gdal110.dll!OGRSpatialReference::Clone() Line 532 gdal110.dll!OGRProj4CT::InitializeNoLock?(OGRSpatialReference * poSourceIn=0x25918408, OGRSpatialReference * poTargetIn=0x114ef9d0) Line 562 gdal110.dll!OGRProj4CT::Initialize(OGRSpatialReference * poSourceIn=0x25918408, OGRSpatialReference * poTargetIn=0x114ef9d0) Line 544 gdal110.dll!OGRCreateCoordinateTransformation(OGRSpatialReference * poSource=0x25918408, OGRSpatialReference * poTarget=0x114ef9d0) Line 410

Attachments (3) (5.1 KB) - added by vindoctor 3 years ago.
the example app to show how to create the issue
5366_setlocale_wrapper.patch (6.4 KB) - added by tolon 3 years ago.
patch_5366_modified_by_evenr.patch (6.9 KB) - added by Even Rouault 3 years ago.

Download all attachments as: .zip

Change History (31)

comment:1 Changed 3 years ago by vindoctor

Component: defaultOGR_SRS

comment:2 Changed 3 years ago by Even Rouault

Sorry but we cannot help you with such a report. You would need to provide a self-contained source code and a data sample that can reproduce your issue. Generally, you cannot access the same OGRDataSource object from different threads. Anything can happen if you do that. From the stack trace, it would look as if one of the OGRSpatialReference* object used in OGRCreateCoordinateTransformation() has already been destroyed.

comment:3 Changed 3 years ago by vindoctor

Thank you for the response.

Maybe things are being misunderstood. When I said each thread calls open(shared), that means each thread calls

poDS = (OGRDataSource *) OGROpenShared(fileName.c_str(),FALSE, NULL );

as well as each thread calls. OGRReleaseDataSource( poDS );

If its being said GDAL is no longer (or never has been) threadsafe where each thread opens its own shared datasource, that would be the issue. It never was an issue in past releases of GDAL and I'm pretty sure I read this is safe to do. (I'm rather seasoned with GDAL and have been coding against it for years).

I'll try to expose the issue in a test app, I was just hoping you guys would know of your recent change that might be the cause.

"it would look as if one of the OGRSpatialReference* object used in OGRCreateCoordinateTransformation() has already been destroyed. "

yeah.. I get the OGRSpatialReference from the layer and I did validate the layer (its a shapefile) does indeed have a projection( we have used this shapefile internally for years with our app). If I cannot recreate the issue with an example app I'll go back to the last trunk version of gdal we tested, and skip fwd till I can narrow down the change.. I was hoping you might know what has changed over the last few months that would effect this.

more to follow....

Cheers, V

comment:4 Changed 3 years ago by Even Rouault

Most popular drivers (like shapefile) should be indeed safe with the pattern you describe. I just tried the torturing test application (in apps subdirectory in GDAL sources, not compiled by default), test_ogrsf, in multi-threaded mode on a shapefile and it works fine. But details might matter a lot...

Changed 3 years ago by vindoctor

Attachment: added

the example app to show how to create the issue

comment:5 Changed 3 years ago by vindoctor

Example app uploaded! It runs without issue in 1 thread. 2 or more threads it crashes within a second. Not sure if it matters, but we do use an external proj4. It won't take me more than a couple minutes if you want me to use the internal proj4, but I think this is unrelated...

comment:6 Changed 3 years ago by Even Rouault

I've adapted your example app to compile and run under Linux and it works just great. I don't see anything wrong in your code. I'm not sure what you mean by internal or external proj4: there's no PROJ.4 code inside GDAL, so it is necessarily external. One thing that could made a difference is if you use proj < 4.8 or proj >= 4.8. If proj 4.8 is used, GDAL will avoid taking mutexes since proj 4.8 can be used in a thread-safe way. There was a version of proj on Windows that had an issue with mutexes too (but the symptom was a deadlock, not a crash).

comment:7 Changed 3 years ago by vindoctor

Yeah, I'm on the prep for 4.9.0 beta release, svn revision 2411 of proj4. I'll try version 4.8 stable this evening and hope this is the cause. BTW, did you notice the WMS settings commented out? I always have to modify the code as in this ticket to keep those wms cashes from getting randomly placed around my os. Its a 20 second if and IMO its a bug. I also make it so you can set the cache level. If you don't use an XML file to interface with WMS, you have no way in setting these properly. (so enough of talking about another ticket on this one)

I'll report back my findings and thanks for helping me look into this.

comment:8 Changed 3 years ago by vindoctor

Going back to proj 4.8 didn't remove the issue. Just to document, if I place a critical sections around OGRCreateCoordinateTransformation and OGRCoordinateTransformation::DestroyCT we don't hit the threading issue/crash. Not sure if that's any good info.

I also have a vs2005 project that uses gdal the same way and it was last tested with gdal's trunk revision 26053 (current trunk revision is 26901). I ran the same test, and no crash(but built with vs2005)! What I'll do is pull that revision of gdal and build with vs2012, which will validate its something that has changed(or not) between revisions 26053 and 26901.

comment:9 Changed 3 years ago by vindoctor

Just do document, the crash does not happen with a vs2012 build using gdal's trunk revision 26053. So this is good news in that its not a compiler specific issue. I'm not sure where to start looking, so I'll start with the svn log for something that looks like an area of interest.

comment:10 Changed 3 years ago by vindoctor

Hi Rouault, you still with me? I just did a build of the released version of gdal 1.10.1 with no extra extensions, and the issue exists. I'm hoping you can do an extended run on your Linux machine. This afternoon when I ran the test, it took about 10 minutes before it crashed (typical to be so random). I think Linux's OS thread model is a little more deterministic so maybe even if you add a random millisecond sleep once in a while. I also have two other threads that are not opening and closing the dataset but just creating and destroying the transforms.

What has me scratching my head is what between VSIMalloc and malloc to make the size get corrupted.

msvcr110.dll!malloc(unsigned int size=7003400) Line 91 C gdal110.dll!VSIMalloc(unsigned int nSize=10) Line 504 C++ gdal110.dll!CPLMalloc(unsigned int nSize=10) Line 132 C++ gdal110.dll!CPLCalloc(unsigned int nCount=10, unsigned int nSize=1) Line 90 C++ gdal110.dll!CSLTokenizeString2(const char * pszString=0x1074eb28, const char * pszDelimiters=0x107402c8, int nCSLTFlags=1) Line 802 C++ gdal110.dll!CSLTokenizeStringComplex(const char * pszString=0x1074eb28, const char * pszDelimiters=0x107402c8, int bHonourStrings=1, int bAllowEmptyTokens=0) Line 729 C++ gdal110.dll!OGRSpatialReference::GetAttrNode?(const char * pszNodePath=0x1074eb28) Line 428 C++ gdal110.dll!OGRSpatialReference::GetAttrNode?(const char * pszNodePath=0x1074eb28) Line 453 C++ gdal110.dll!OGRSpatialReference::GetAttrValue?(const char * pszNodeName=0x1074eb28, int iAttr=0) Line 484 C++ gdal110.dll!OGRSpatialReference::exportToProj4(char * * ppszProj4=0x02e4f650) Line 1401 C++ gdal110.dll!OGRProj4CT::InitializeNoLock?(OGRSpatialReference * poSourceIn=0x02e4f7d0, OGRSpatialReference * poTargetIn=0x00765a90) Line 656 C++ gdal110.dll!OGRProj4CT::Initialize(OGRSpatialReference * poSourceIn=0x02e4f7d0, OGRSpatialReference * poTargetIn=0x00765a90) Line 543 C++ gdal110.dll!OGRCreateCoordinateTransformation(OGRSpatialReference * poSource=0x02e4f7d0, OGRSpatialReference * poTarget=0x00765a90) Line 410 C++

comment:11 Changed 3 years ago by Even Rouault

I've run the test for longer, but still no issue. Regarding the inconsistency between the size in VSIMalloc and malloc that might be due to optimizations. To help you tracking down the problem, I would suggest that you simplify your code to the minimum possible. For example just a OGRCreateCoordinateTransformation() and OGRCoordinateTransformation::DestroyCT() in a endless loop if it is sufficient to get the crash. Then you add the critical section around OGRCreateCoordinateTransformation(). If it avoids the crash, then go to the implementation of OGRCreateCoordinateTransformation() and move the critical section around poCT->Initialize(), and so on and so on until you identify the code that has the race. I admit this might be extremely tedious, but I don't imagine a better solution.

comment:12 Changed 3 years ago by tolon

I've used V's code with my own data to generate an access violation and have investigated it a bit. The stack I get is different from the reported one, but I am assuming that is due to using different data.

It looks to me like loading the data in multiple threads this way is not a good idea at the moment. Each thread ends up modifying the locale multiple times, possibly through the use of the CPLLocaleC class (implemented in port/cpl_conv.cpp, although it appears to be checking the locale first...); these uses are described as 'forcing' the C locale and typically occur before logic that uses string data loaded from your data. There are other cases where the locale might be implicity or temporarily set in the implementation of some runtime methods, although I'm only looking at the MSVC runtime here. In addition to this, there are obviously lots of other string handling functions that read the locale, such as sprintf(). The risk here is that one thread changes the global locale whilst another is using it; this is a potential data race - see section, paragraph 5 (page 225) of the C standard - Whether it is a genuine problem of not will depend on the implementation of your compiler's runtime library.

Despite this, MSVC seems to have coped with it quite well prior to 2012. With 2012, the runtime for MSVC has a new implementation of setlocale() that doesn't appear to behave quite so well with this issue. I suspect there is a data race between the MS implementation of setlocale() and the function it delegates to, _wsetlocale(), although the standard warns us of such problems. If you want to look into this a bit more yourself, compare a 2010 version of setlocal.c with a 2012 version of setlocal.c and wsetloca.c. I've tried using _configthreadlocale(_ENABLE_PER_THREAD_LOCALE) to give each thread its own locale, which has had some mixed results. MSDN appears to claim that you can set this on the main thread and then all subsequent calls to _beginthread() and _beginthreadex() will use the same setting, but my tests and their runtime source code disagrees with this. Calling it on each thread however does improve things, although I still get some Application Verifier issues being reported and an occassional crash, usually in sprintf.

V - can you give _configthreadlocale(_ENABLE_PER_THREAD_LOCALE) a try in your threads and see if that helps?

HTH! FYI, I tested this on Visual Studio 2012, x86 on Windows 7 64-bit.

For further info on multiple threads and setlocale, see:

BTW V, in your test app, you should be calling a C runtime method to launch your thread and not CreateThread?(). I changed the code in my copy, but still got the access violations.

comment:13 Changed 3 years ago by Even Rouault


to confirm your hypothesis, what happens if you force the locale to C at the beginning of your main ? You shouldn't have crashes then since CPLLocaleC will be a no operation. An alternative is to define the GDAL_DISABLE_CPLLOCALEC configuration option/environmenet variable to YES.

comment:14 Changed 3 years ago by tolon

Hi rouault, setlocale("C") before the threads are created or at the start of the threads themselves makes no difference. GDAL_DISABLE_CPLLOCALEC=YES makes no difference. This suggests that simultaneous setlocale() calls isn't necessarily the problem here, however...

The access violations appear to correlate with corruption I think is being caused by runtime library code to delete thread locale information when no longer in use (detected via reference counting). A stack trace of one of these cases is below - note the calls to snprintf and the subsequent _updatetlocinfoEx_nolock(). Smells like a possible reference counting issue...

I managed to reproduce a very similar crash without GDAL using just setlocale() on multiple threads. I'll try doing this for some other locale-related methods too, e.g. snprintf and will report back later.

msvcr110.dll!__removelocaleref(threadlocaleinfostruct * ptloci=0x15596f48) Line 100
msvcr110.dll!_updatetlocinfoEx_nolock(threadlocaleinfostruct * * pptlocid=0x0ed8ccac, threadlocaleinfostruct * ptlocis=0x14ccff48) Line 249
msvcr110.dll!__updatetlocinfo() Line 291
msvcr110.dll!_LocaleUpdate::_LocaleUpdate(localeinfo_struct * plocinfo)
msvcr110.dll!_output_l(_iobuf * stream=0x0e9df86c, const char * format=0x54321168, localeinfo_struct * plocinfo=0x00000000, char * argptr=0x0e9df8a0) Line 1028
msvcr110.dll!_snprintf(char * string=0x0e9df8c8, unsigned int count=24, const char * format=0x54321168, ...) Line 130
gdal110.dll!OSRGetEllipsoidInfo(int nCode=7019, char * * ppszName=0x0e9df95c, double * pdfSemiMajor=0x0e9df908, double * pdfInvFlattening=0x0e9df900) Line 701
gdal110.dll!SetEPSGGeogCS(OGRSpatialReference * poSRS=0x0e9dfbd4, int nGeogCS=4619) Line 1252
gdal110.dll!SetEPSGProjCS(OGRSpatialReference * poSRS=0x0e9dfbd4, int nPCSCode=3006) Line 1423
gdal110.dll!OGRSpatialReference::importFromEPSGA(int nCode=3006) Line 2191
gdal110.dll!OGRSpatialReference::importFromEPSG(int nCode=3006) Line 2094
gdal110.dll!OGRSpatialReference::SetWellKnownGeogCS(const char * pszName=0x0029d064) Line 1696
GdalTheadIssueTest.exe!CheckThread(void * stuff=0x00000000) Line 73

comment:15 Changed 3 years ago by tolon

It looks like any simultaneous calls to setlocale are a problem, even if the code is just querying the locale, e.g. pszOldLocale(CPLStrdup(setlocale(LC_NUMERIC,NULL))) in cpl_conv.cpp. The following program crashes with the following stack trace.

#include <locale.h>
#include <process.h>
#include <Windows.h>

void setlocale_loop(void*)
        setlocale(LC_NUMERIC, NULL);

int main()
    for (int i = 0; i < 2; i++)
        _beginthread(&setlocale_loop, 0, NULL);

    // Sleep the main thread
    Sleep(60 * 1000);
msvcr110.dll!free(void * pBlock=0x00769650) Line 51
msvcr110.dll!__freetlocinfo(threadlocaleinfostruct * ptloci=0x00740000) Line 202
msvcr110.dll!_updatetlocinfoEx_nolock(threadlocaleinfostruct * * pptlocid, threadlocaleinfostruct * ptlocis) Line 250
msvcr110.dll!_wsetlocale(int _category=4, const wchar_t * _wlocale=0x00000000) Line 569
msvcr110.dll!setlocale(int _category=4, const char * _locale=0x00000004) Line 50
SetLocaleThreadSafety.exe!setlocale_loop(void * __formal=0x00000000) Line 9
msvcr110.dll!_callthreadstart() Line 255
msvcr110.dll!_threadstart(void * ptd) Line 237

comment:16 Changed 3 years ago by Even Rouault

Interseting... Could you replace the CPLLocaleC constructor and destructor implementations by really empty ones ?

comment:17 Changed 3 years ago by tolon

Ok, I removed the setlocale() calls in CPLLocaleC and no longer get the problem. I've tried a couple of 10 minute runs on both debug and release builds.

So it looks like we have an explanation, but I don't think there is an easy way to fix this. I'm guessing we can't just remove the setlocale() calls like I did... :)

I suspect that calling the wide character version of the function, _wsetlocale(), would solve the problem for MSVC 2012 (version 1700) onwards. It would mean wrapping setlocale() and doing some wide character conversion in cpl_conv.cpp, which sounds a bit messy. I don't like it , but does that sound like an option? (i.e. if it works, would it be ok to add a compiler-specific workaround?)

I've added some information to one of the existing Microsoft Connect bug reports in this area, so we'll see how they respond:

comment:18 Changed 3 years ago by Even Rouault

I'm afraid the right solution would be to have locale unaware printf()-like and use it extensively throughtout the code base. But that's a bit more complicated than that since proj.4 for example use printf()-like, so it is not a GDAL-only problem...

Changed 3 years ago by tolon

comment:19 Changed 3 years ago by tolon

I think I have a solution. The latest C standard states, "A call to the setlocale function may introduce a data race with other calls to the setlocale function or with calls to functions that are affected by the current locale. The implementation shall behave as if no library function calls the setlocale function."

The two important facts I read from this are:

1) Any call to setlocale() may introduce a data race.

2) The library should behave 'as if' setlocale() is not called by any library methods.

I changed my test application so that one thread repeatedly called setlocale() whilst many others called sprintf() and it did not crash (on my platform).

Based on this, I've produced a patch that wraps the setlocale() calls in cpl_conv.h in a critical section, which appears to solve the crash I was getting with V's test application.

As well as changing setlocale() to the new wrapper CPLsetlocle() in cpl_conv.cpp, I've also made the same change in a couple of OGR files and the geojson driver.

At the very least, this patch should be an improvement as it should prevent the simultaneous calls of setlocale(), which we are warned about in the standard.

comment:20 Changed 3 years ago by Even Rouault

Hum, are you really sure that you've tested the effects of the patch properly. I have strong doubts since HAVE_SETLOCALE is not defined neither for Linux or Windows (I guess you took inspiration from code above in the file, that, yes seems to be dead code actually...) so your implementation just return NULL without calling setlocale() at all... So yes, you "solve" the setlocale issue... And actually when I remove the #ifdef, there's a compilation error since CPLMutexHolder(&hSetLocaleMutex); is not correct C++, and should be CPLMutexHolder oHolder(&hSetLocaleMutex); instead. So I'm attaching the fixed version so that you can test it (I've reverted the change in libjson since it is external code that is imported into OGR, and we don't want to fork it. Plus the fact the code path you patched was actually not compiled fro the same reason as above). And an interesting test would be that you define at the beginning of the main() the locale to be a non C locale.

Changed 3 years ago by Even Rouault

comment:21 Changed 3 years ago by tolon

Sorry, I didn't realise that HAVE_SETLOCALE was broken. The concept of my original fix was tested before I 'tidied' it up and added the preprocessor stuff. Thanks for spotting the missing variable name; the effect was that the object was being created and immediately destroyed, I guess it was still providing some level of synchronization...

I've successfully tested the updated patch with default and non-default locales in both release and debug mode on Windows.

comment:22 Changed 3 years ago by Even Rouault

Milestone: 1.11.0
Resolution: fixed
Status: newclosed

trunk r27121 "Protect concurrent calls to setlocale() by a mutex (derived from patch by tolon, #5366)"

comment:23 Changed 3 years ago by vindoctor

Resolution: fixed
Status: closedreopened

I'm not sure if this should be reopened or a new ticket. I'm working on getting the callstack now (build with symbols). But we are still crashing on setlocale in the code block below.

if (!(ptd->_ownlocale & _PER_THREAD_LOCALE_BIT) &&

!(globallocalestatus & _GLOBAL_LOCALE_BIT)) {

if (ptloci->lc_category[_category].refcount != NULL &&

InterlockedDecrement?(ptloci->lc_category[_category].refcount) == 0) { _free_crt(ptloci->lc_category[_category].refcount);



comment:24 Changed 3 years ago by vindoctor

Just to note, I guess one big thing is that I'm on the latest trunk, 2.0, not the 1x version. So maybe we will find the fix was accidentally removed.. TBD

comment:25 Changed 2 years ago by vindoctor

Component: OGR_SRSGDAL_Raster

I've added to places we crash. with the locale setting. As a heads up in regards to how bad this is, my app is pretty much unusable right now... so hopefully we can figure this out. Note that I updated to he latest proj4 as well and I don't see where GDAL protects calls into proj4 when it comes to the locale

msvcr110.dll!malloc(unsigned int size=8781824) Line 91 C gdal200.dll!CPLMalloc(unsigned int nSize=2) Line 136 C++ gdal200.dll!CPLStrdup(const char * pszString=0x1db8d38c) Line 255 C++ gdal200.dll!CPLLocaleC::CPLLocaleC() Line 2429 C++ gdal200.dll!GDALClose(void * hDS=0x3ee5a318) Line 2603 C++


msvcr110.dll!_unlock(int locknum) Line 366 C

msvcr110.dll!setlocale(int _category=39714816, const char * _locale=0x2f7e8a23) Line 119 C msvcr110.dll!malloc(unsigned int size=39714816) Line 91 C 00000003() Unknown

msvcr110.dll!malloc(unsigned int size=51451010) Line 91 C


() C

gdal200.dll!OGRProj4CT::InitializeNoLock?(OGRSpatialReference * poSourceIn, OGRSpatialReference * poTargetIn) Line 668 C++ msvcr110.dll!getenv(const char * option=0x00000000) Line 75 C 1182f184() Unknown ntdll.dll!771efba1() Unknown KernelBase?.dll!768d0fb0() Unknown gdal200.dll!OGRProj4CT::OGRProj4CT() Line 482 C++ gdal200.dll!_whirlpool_block_mmx

() C++

kernel32.dll!764a14ad() Unknown gdal200.dll!CPLString::~CPLString() C++

So I'm not sure if anyone gets notified with this ticket being reopened. I guess if I don't see any response I'll close this ticket and open a new one. It seems to be in regards to the same issue, but with a different location on where the problem can happen. My guess is proj4 is not protected which is why the protected gdal set locale has issues...

comment:27 Changed 2 years ago by Even Rouault

vindoctor, I agree with your analysis.

As a workaround, you can perhaps try setting USE_PROJ_480_FEATURES=NO as environment variable. GDAL in that case will go back to using proj under a lock.

Ultimately we should perhaps use locale unaware formatting/scaning functions in both libraries.

comment:28 Changed 2 years ago by Even Rouault

Milestone: 2.0
Resolution: duplicate
Status: reopenedclosed

Closing that ticket. It seems this issue is now mostly alive in #5747

Note: See TracTickets for help on using tickets.