Opened 13 years ago

Closed 10 years ago

#2648 closed defect (fixed)

GDAL reference counting is thread unsafe

Reported by: tamas Owned by: warmerdam
Priority: high Milestone:
Component: OGR_SF Version: unspecified
Severity: normal Keywords: threads refcount
Cc: hobu, tamas

Description

Currently the reference counting implementation doesn't use atomic increment and decrement operations, and therefore it may cause memory corruption issues with the garbage collected runtimes like for C#. Attached a C# example which can produce the problem quite systematically.

I think we could utilize something like the InterlockedIncrement/InterlockedDecrement? API on Windows to make these operations atomic.

Attachments (1)

OGRTest.cs (3.6 KB) - added by tamas 13 years ago.

Download all attachments as: .zip

Change History (12)

Changed 13 years ago by tamas

Attachment: OGRTest.cs added

comment:1 Changed 13 years ago by tamas

I've tried to utilize CPLMutexHolder in Reference and Dereference and it could solve the problem, however it seems to be a sub-optimal solution. Instead, we should rely on the atomic increment/decrement operations supported by the processor/architecture.

comment:2 Changed 13 years ago by Even Rouault

Tamas,

I didn't manage to make compile your example, but if I understand well what you mean, the problem would lie in the loop that creates the 5000 features ? I'm not familiar with C#, but it looks like your example doesn't make explicit use of multithreading and I can't think of a legitimate use of multithreading where 2 threads would use the same OGRFeatureDefn. So if a crash linked to non atomic operations occurs, it would mean that the C# garbage collector tries to call the destructor of the Feature() object in another thread ? Is that how it works ? I couldn't find any reference on how it works

Anyway, I've adapted your test program into the following :

using System;
using OSGeo.OGR;

public class OGRTest
{
    public static void Main(string[] args)
    {
        // Register OGR
        Ogr.RegisterAll();

        // Get ESRI Shapefile Driver
        Driver drv = Ogr.GetDriverByName("ESRI Shapefile");

        // Create datasource
        DataSource ds = drv.CreateDataSource("foo", new string[] { });

        // Create layer
        Layer layer = ds.CreateLayer("foo", null, wkbGeometryType.wkbLineString, new string[] { });

        // Add features
        for (int i = 0; i < 50000; i++)
        {
            Feature feature = new Feature(layer.GetLayerDefn());
        }
    }
}

And this works fine on Linux with mono 1.2.6. Perhaps because my system is single core, so race conditions are less likely to occur.

If the problem is specific to how C# works, I think the best would be to add the mutex mechanism in them rather than in OGR itself.

comment:3 in reply to:  2 Changed 13 years ago by tamas

Even,

Yes, the refcount decrement most likely occurs within a different thread handled by the garbage collector, ant that causes the problem. I'm afraid I cannot workaround this issue in the SWIG level itself, since many of the Reference/Dereference? operations are hadled in the core stuff implicitly which is not visible from the C# interface.

The most trivial option would be to handle this inside the core where the refcount increment/decrement is taking place. Something like the AddRef? and Release is handled by the ATL COM implementations.

Tamas

Replying to rouault:

Tamas,

I didn't manage to make compile your example, but if I understand well what you mean, the problem would lie in the loop that creates the 5000 features ? I'm not familiar with C#, but it looks like your example doesn't make explicit use of multithreading and I can't think of a legitimate use of multithreading where 2 threads would use the same OGRFeatureDefn. So if a crash linked to non atomic operations occurs, it would mean that the C# garbage collector tries to call the destructor of the Feature() object in another thread ? Is that how it works ? I couldn't find any reference on how it works

Anyway, I've adapted your test program into the following :

using System;
using OSGeo.OGR;

public class OGRTest
{
    public static void Main(string[] args)
    {
        // Register OGR
        Ogr.RegisterAll();

        // Get ESRI Shapefile Driver
        Driver drv = Ogr.GetDriverByName("ESRI Shapefile");

        // Create datasource
        DataSource ds = drv.CreateDataSource("foo", new string[] { });

        // Create layer
        Layer layer = ds.CreateLayer("foo", null, wkbGeometryType.wkbLineString, new string[] { });

        // Add features
        for (int i = 0; i < 50000; i++)
        {
            Feature feature = new Feature(layer.GetLayerDefn());
        }
    }
}

And this works fine on Linux with mono 1.2.6. Perhaps because my system is single core, so race conditions are less likely to occur.

If the problem is specific to how C# works, I think the best would be to add the mutex mechanism in them rather than in OGR itself.

comment:4 Changed 13 years ago by warmerdam

Component: defaultOGR_SF
Keywords: threads refcount added
Priority: normalhigh

I'm astonished that a race condition on a two processor system on the reference count increment/decrement operations would actually occur with any regularity with such a small number of features (5000).

I am wondering if the problem might also be related to delayed writing-back from a register copy of the reference count to the actual memory. The solution for this is normally to define the variable as volatile so the compiler won't register cache it.

comment:5 in reply to:  4 Changed 13 years ago by tamas

Replying to warmerdam:

I'm astonished that a race condition on a two processor system on the reference count increment/decrement operations would actually occur with any regularity with such a small number of features (5000).

I've tried with using locks around the refcount operations and it solved the problem.

I am wondering if the problem might also be related to delayed writing-back from a register copy of the reference count to the actual memory. The solution for this is normally to define the variable as volatile so the compiler won't register cache it.

In my understanding by declaring the refcount as 'volatile int' on it's own isn't actually safe at all. This just ensures the 2 CPU's see the same data at the same time, but it doesn't stop them at all from interleaving their reads and write operations which is the problem we are trying to avoid.

However using heavy locks around a single increment/decrement is not to convenient, instead, the atomic operations supported by the processor should be used. On Windows platform it would require to use the InterlockedIncrement/Decrement? API but I'm not sure about the GNU/Linux/OSX counterparts of that.

comment:6 Changed 13 years ago by tamas

I've uploaded my complete VS2005 test project here http://vbkto.dyndns.org:1280/tests/OGRTest2005.rar

I've included 2 versions of gdal16dev.dll here. The lock protected version provides to run the tests without crashing.

comment:7 Changed 13 years ago by Even Rouault

I've experienced the same kind of issues with Java too where garbage collecting can occur in a separate thread. The consequence was not a crash but that the reference counter of the feature defn was > 0 even when all peered objects had been finalized. I also concur with Tamas' conclusion that volatile is not sufficient in the SMP case.

I've introduced in trunk the API and implementation for thread and SMP safe atomic increments (r16027) and used it in ogr_feature.h (r16028).

As the implementation is platform specific and I could only test on Linux i386, I'd appreciate testing, particularly on Windows and MacOSX where I've used the OS services but couldn't test by myself.

comment:8 Changed 13 years ago by Even Rouault

Cc: hobu tamas added

comment:9 in reply to:  7 Changed 13 years ago by tamas

Replying to rouault:

I've introduced in trunk the API and implementation for thread and SMP safe atomic increments (r16027) and used it in ogr_feature.h (r16028).

As the implementation is platform specific and I could only test on Linux i386, I'd appreciate testing, particularly on Windows and MacOSX where I've used the OS services but couldn't test by myself.

I've tested the implementation and it seems to be working OK on Windows with a minor fix applied in r16035. Thank you for taking care of this problem, it's quite challenging to make it equally well on each supported platforms :-)

I'd support to fix this issue in the stable branch as well.

comment:10 Changed 13 years ago by Even Rouault

In r16378 I've used atomic inc/dec for spatial references associated to a geometry.

comment:11 Changed 10 years ago by Even Rouault

Milestone: 1.6.4
Resolution: fixed
Status: newclosed

I believe this is fixed now. Closing

Note: See TracTickets for help on using tickets.