Opened 14 years ago

Closed 14 years ago

Last modified 14 years ago

#3245 closed defect (fixed)

Raster layer with ecw data in fastCGI mode can fill up memory quickly

Reported by: aboudreault Owned by: warmerdam
Priority: normal Milestone:
Component: GDAL Support Version: svn-trunk (development)
Severity: normal Keywords: raster, ecw, fastcgi
Cc: dmorissette, jmckenna

Description

It seems that MapServer (or GDAL) doesn't close properly a raster data source in fastCGI mode. The close connection mode used is NORMAL. I'm using a very simple mapfile with only 1 raster layer (ECW file of ~1.2G) and MapServer memory usage grow by 25-30MB every request.

I tested GDAL 1.5.4 and 1.6.3 and currently compiling 1.7.0b1. (Will report later)

Attachments (1)

test.map (614 bytes ) - added by aboudreault 14 years ago.
The mapfile used to reproduce the bug

Download all attachments as: .zip

Change History (15)

by aboudreault, 14 years ago

Attachment: test.map added

The mapfile used to reproduce the bug

comment:1 by aboudreault, 14 years ago

Version: unspecifiedsvn-trunk (development)

I can also reproduce the bug using GDAL 1.7.0b1.

comment:2 by dmorissette, 14 years ago

Milestone: 5.6.1 release

comment:3 by warmerdam, 14 years ago

Cc: warmerdam removed
Owner: changed from aboudreault to warmerdam

This should likely be a GDAL ticket.

I have tried running my build like:

valgrind --tool=memcheck --track-fds=yes --num-callers=16 --show-reachable=yes \
         --leak-check=yes --suppressions=/home/warmerda/bin/valgrind.sup --leak-resolution=high \
   gdalinfo -mm downtown.ecw

and a leak summary was produced like:

==7953== LEAK SUMMARY:
==7953==    definitely lost: 8,193 bytes in 1 blocks.
==7953==      possibly lost: 860 bytes in 20 blocks.
==7953==    still reachable: 2,916 bytes in 20 blocks.
==7953==         suppressed: 1,520 bytes in 15 blocks.

So, in this situation where I did actually read over the whole 4K x 4K image I did not see a memory leak anything like you are seeing.

Could you specify what ECW SDK you are using? Do you see problems with a single shp2img or mapserv or gdalinfo -mm run with valgrind?

comment:4 by dmorissette, 14 years ago

Actually, we forgot to mention that the leak doesn't show up in Valgrind. We saw that small 2kB leak as well but that's nothing compared to the MB's which are piling up on every request. We use top to watch the fcgi process and see it growing with every request.

I'm not sure which version of the ECW SDK is used, but I would guess it must be libecw_3.3-2.tar.gz since that's what I have and I got it from Alan. Alan is out of the office today so he may reply only in a few days.

comment:5 by aboudreault, 14 years ago

I confirm that the version of libecw used is 3.3-2. As Daniel said, I can see a small leak (2-4k) if I run a single mapserv request with valgrind. Are you able to reproduce the bug in fastcgi mode ? (If you don't have an ready to test fastcgi environnement, I have a virtual machine, on which I could create you an account)

comment:6 by warmerdam, 14 years ago

Status: newassigned

I have confirmed the problem with PROCESSING "CLOSE_CONNECTION=NORMAL" using "shp2img -c 150 -m downtown.map -o out.png" so the problem is not specifically related to fastcgi.

Digging in...

comment:7 by warmerdam, 14 years ago

Problem also observed with this script:

from osgeo import gdal

for i in range(100):
    ds = gdal.Open('downtown.ecw')
    ds.ReadRaster(0,0,1000,1000)
    ds = None

So the problem is not MapServer, it is GDAL or the ECW SDK. The same problem does not occur with a TIFF file as input.

comment:8 by warmerdam, 14 years ago

Component: MapServer C LibraryGDAL Support
Milestone: 5.6.1 release

comment:9 by warmerdam, 14 years ago

Findings so far.

1) the increase in memory as reported by top appears to only be in the virtual memory column and does not appear to have any real effect on the running process on my 64bit system. I let it run up to about 32GB of virtual memory with no apparent ill effect. I can imagine on a 32bit system things might fail at the point 2 or 4GB of virtual memory were mapped.

2) The problem seems to be the following mmap() from within pthread_create() which has no corresponding munmap(). The ECW SDK creates a thread for each file opened.

 mmap(NULL, 8392704, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS|0x40, -1, 0) = -1 ENOMEM (Cannot allocate memory)
mmap(NULL, 8392704, PROT_READ|PROT_WRITE|PROT_EXEC, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x2aaaccaf4000
mprotect(0x2aaaccaf4000, 4096, PROT_NONE) = 0
clone(child_stack=0x2aaacd2f4260, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0x2aaacd2f49e0, tls=0x2aaacd2f4950, child_tidptr=0x2aaacd2f49e0) = 13993

3) I have confirmed that the thread is being destroyed in a timely fashion with a call to pthread_exit(). I is unclear to me why this does not result in recovery of all the resources though I do observe that NCSThreadFreeInfo() is never called, nor is pthread_join().

4) Calling NCSecwShutdown(); after each file is closed does appear to recover the memory. I have not established why. However, this function can take several seconds due to something funky about how it waits for pending threads so it is currently not ever called by the GDAL ECWDataset implementation.

I shall continue to review what NCSecwShutdown() does and see if there is a practical way of getting the thread resources cleaned up.

comment:10 by jmckenna, 14 years ago

Cc: jmckenna added

comment:11 by warmerdam, 14 years ago

Resolution: fixed
Status: assignedclosed

It appears the problem was that CNCSJP2FileView::sm_Views which derived from CNCSThread was calling Spawn() each time a new fileview was created, but the threadinfo corresponding threadinfo structure was never being freed until final application shutdown when the whole threadinfo table was cleared.

There were several possible ways to deal with this, but I tried to take a fairly harmless approach as seen in this patch:

*** NCSThread.cpp.orig  2010-01-06 19:54:47.000000000 -0500
--- NCSThread.cpp       2010-01-06 19:44:10.000000000 -0500
***************
*** 47,52 ****
--- 47,58 ----
        //void **pStartData = (void **)NCSMalloc(2 * sizeof(void*), TRUE);
        //pStartData[0] = (void*)this;
        //pStartData[1] = pData;
+         if( m_Thread != NCS_NULL_THREAD_ID )
+         {
+             //fprintf( stderr, "Cleanup old threadinfo in spawn()\n" );
+             FreeInfo();
+             m_Thread = NCS_NULL_THREAD_ID;
+         }
  
        m_pData = pData;
        m_bRun = true;

NCSThread.cpp is in libecw2-3.3/Source/C/NCSUtil.

The above basically ensures that FreeInfo() is called if a thread is initialized and active before Spawn() will hide it by generating a new m_Thread value.

This appears to resolve the issue in limited testing.

comment:12 by warmerdam, 14 years ago

I have referenced this change in http://trac.osgeo.org/gdal/wiki/ECW and GDAL ticket 3161.

comment:13 by aboudreault, 14 years ago

Great! I confirm that your patch fixed the bug.

However, these was still a small leak of 4K (that we talked a few time ago). I applied the following patch to libecw to fix it:

--- libecwj2-3.3/Source/C/tinyxml/tinyxml.cpp   2006-07-02 21:15:12.000000000 -0400
+++ libecwj2-3.3_patched/Source/C/tinyxml/tinyxml.cpp      2010-01-07 08:26:26.000000000 -0500
@@ -1020,6 +1020,8 @@

                if ( fread( buf, length, 1, file ) != 1 ) {
                //if ( fread( buf, 1, length, file ) != (size_t)length ) {
+                       delete [] buf;
+                       buf = 0;
                        SetError( TIXML_ERROR_OPENING_FILE, 0, 0, TIXML_ENCODING_UNKNOWN );
                        fclose( file );
                        return false;

Thanks a lot warmerdam for your work!

comment:14 by dmorissette, 14 years ago

Thanks a lot for your amazing work on this one Frank!

For the record, the GDAL ticket referencing this issue (and more) is 3162, not 3161 http://trac.osgeo.org/gdal/ticket/3162

Are you planning on sending a note to the libecw devs about this? I believe they were made aware of ticket 3162 back on 2007, but it would not hurt to refresh their memory. I could take care of it if you have some contact info you can send me, but I would think the reminder would have more chances of getting noticed if it comes from you than from me.

Thanks again!

Note: See TracTickets for help on using tickets.