Opened 14 years ago

Closed 11 years ago

#2648 closed defect (fixed)

GDAL reference counting is thread unsafe

Reported by: tamas Owned by: warmerdam
Priority: high Milestone:
Component: OGR_SF Version: unspecified
Severity: normal Keywords: threads refcount
Cc: hobu, tamas

Description

Currently the reference counting implementation doesn't use atomic increment and decrement operations, and therefore it may cause memory corruption issues with the garbage collected runtimes like for C#. Attached a C# example which can produce the problem quite systematically.

I think we could utilize something like the InterlockedIncrement/InterlockedDecrement API on Windows to make these operations atomic.

Attachments (1)

OGRTest.cs (3.6 KB ) - added by tamas 14 years ago.

Download all attachments as: .zip

Change History (12)

by tamas, 14 years ago

Attachment: OGRTest.cs added

comment:1 by tamas, 14 years ago

I've tried to utilize CPLMutexHolder in Reference and Dereference and it could solve the problem, however it seems to be a sub-optimal solution. Instead, we should rely on the atomic increment/decrement operations supported by the processor/architecture.

comment:2 by Even Rouault, 14 years ago

Tamas,

I didn't manage to make compile your example, but if I understand well what you mean, the problem would lie in the loop that creates the 5000 features ? I'm not familiar with C#, but it looks like your example doesn't make explicit use of multithreading and I can't think of a legitimate use of multithreading where 2 threads would use the same OGRFeatureDefn. So if a crash linked to non atomic operations occurs, it would mean that the C# garbage collector tries to call the destructor of the Feature() object in another thread ? Is that how it works ? I couldn't find any reference on how it works

Anyway, I've adapted your test program into the following :

using System;
using OSGeo.OGR;

public class OGRTest
{
    public static void Main(string[] args)
    {
        // Register OGR
        Ogr.RegisterAll();

        // Get ESRI Shapefile Driver
        Driver drv = Ogr.GetDriverByName("ESRI Shapefile");

        // Create datasource
        DataSource ds = drv.CreateDataSource("foo", new string[] { });

        // Create layer
        Layer layer = ds.CreateLayer("foo", null, wkbGeometryType.wkbLineString, new string[] { });

        // Add features
        for (int i = 0; i < 50000; i++)
        {
            Feature feature = new Feature(layer.GetLayerDefn());
        }
    }
}

And this works fine on Linux with mono 1.2.6. Perhaps because my system is single core, so race conditions are less likely to occur.

If the problem is specific to how C# works, I think the best would be to add the mutex mechanism in them rather than in OGR itself.

in reply to:  2 comment:3 by tamas, 14 years ago

Even,

Yes, the refcount decrement most likely occurs within a different thread handled by the garbage collector, ant that causes the problem. I'm afraid I cannot workaround this issue in the SWIG level itself, since many of the Reference/Dereference operations are hadled in the core stuff implicitly which is not visible from the C# interface.

The most trivial option would be to handle this inside the core where the refcount increment/decrement is taking place. Something like the AddRef and Release is handled by the ATL COM implementations.

Tamas

Replying to rouault:

Tamas,

I didn't manage to make compile your example, but if I understand well what you mean, the problem would lie in the loop that creates the 5000 features ? I'm not familiar with C#, but it looks like your example doesn't make explicit use of multithreading and I can't think of a legitimate use of multithreading where 2 threads would use the same OGRFeatureDefn. So if a crash linked to non atomic operations occurs, it would mean that the C# garbage collector tries to call the destructor of the Feature() object in another thread ? Is that how it works ? I couldn't find any reference on how it works

Anyway, I've adapted your test program into the following :

using System;
using OSGeo.OGR;

public class OGRTest
{
    public static void Main(string[] args)
    {
        // Register OGR
        Ogr.RegisterAll();

        // Get ESRI Shapefile Driver
        Driver drv = Ogr.GetDriverByName("ESRI Shapefile");

        // Create datasource
        DataSource ds = drv.CreateDataSource("foo", new string[] { });

        // Create layer
        Layer layer = ds.CreateLayer("foo", null, wkbGeometryType.wkbLineString, new string[] { });

        // Add features
        for (int i = 0; i < 50000; i++)
        {
            Feature feature = new Feature(layer.GetLayerDefn());
        }
    }
}

And this works fine on Linux with mono 1.2.6. Perhaps because my system is single core, so race conditions are less likely to occur.

If the problem is specific to how C# works, I think the best would be to add the mutex mechanism in them rather than in OGR itself.

comment:4 by warmerdam, 14 years ago

Component: defaultOGR_SF
Keywords: threads refcount added
Priority: normalhigh

I'm astonished that a race condition on a two processor system on the reference count increment/decrement operations would actually occur with any regularity with such a small number of features (5000).

I am wondering if the problem might also be related to delayed writing-back from a register copy of the reference count to the actual memory. The solution for this is normally to define the variable as volatile so the compiler won't register cache it.

in reply to:  4 comment:5 by tamas, 14 years ago

Replying to warmerdam:

I'm astonished that a race condition on a two processor system on the reference count increment/decrement operations would actually occur with any regularity with such a small number of features (5000).

I've tried with using locks around the refcount operations and it solved the problem.

I am wondering if the problem might also be related to delayed writing-back from a register copy of the reference count to the actual memory. The solution for this is normally to define the variable as volatile so the compiler won't register cache it.

In my understanding by declaring the refcount as 'volatile int' on it's own isn't actually safe at all. This just ensures the 2 CPU's see the same data at the same time, but it doesn't stop them at all from interleaving their reads and write operations which is the problem we are trying to avoid.

However using heavy locks around a single increment/decrement is not to convenient, instead, the atomic operations supported by the processor should be used. On Windows platform it would require to use the InterlockedIncrement/Decrement API but I'm not sure about the GNU/Linux/OSX counterparts of that.

comment:6 by tamas, 14 years ago

I've uploaded my complete VS2005 test project here http://vbkto.dyndns.org:1280/tests/OGRTest2005.rar

I've included 2 versions of gdal16dev.dll here. The lock protected version provides to run the tests without crashing.

comment:7 by Even Rouault, 14 years ago

I've experienced the same kind of issues with Java too where garbage collecting can occur in a separate thread. The consequence was not a crash but that the reference counter of the feature defn was > 0 even when all peered objects had been finalized. I also concur with Tamas' conclusion that volatile is not sufficient in the SMP case.

I've introduced in trunk the API and implementation for thread and SMP safe atomic increments (r16027) and used it in ogr_feature.h (r16028).

As the implementation is platform specific and I could only test on Linux i386, I'd appreciate testing, particularly on Windows and MacOSX where I've used the OS services but couldn't test by myself.

comment:8 by Even Rouault, 14 years ago

Cc: hobu tamas added

in reply to:  7 comment:9 by tamas, 14 years ago

Replying to rouault:

I've introduced in trunk the API and implementation for thread and SMP safe atomic increments (r16027) and used it in ogr_feature.h (r16028).

As the implementation is platform specific and I could only test on Linux i386, I'd appreciate testing, particularly on Windows and MacOSX where I've used the OS services but couldn't test by myself.

I've tested the implementation and it seems to be working OK on Windows with a minor fix applied in r16035. Thank you for taking care of this problem, it's quite challenging to make it equally well on each supported platforms :-)

I'd support to fix this issue in the stable branch as well.

comment:10 by Even Rouault, 14 years ago

In r16378 I've used atomic inc/dec for spatial references associated to a geometry.

comment:11 by Even Rouault, 11 years ago

Milestone: 1.6.4
Resolution: fixed
Status: newclosed

I believe this is fixed now. Closing

Note: See TracTickets for help on using tickets.