#3245 closed defect (fixed)
Raster layer with ecw data in fastCGI mode can fill up memory quickly
Reported by: | aboudreault | Owned by: | warmerdam |
---|---|---|---|
Priority: | normal | Milestone: | |
Component: | GDAL Support | Version: | svn-trunk (development) |
Severity: | normal | Keywords: | raster, ecw, fastcgi |
Cc: | dmorissette, jmckenna |
Description
It seems that MapServer (or GDAL) doesn't close properly a raster data source in fastCGI mode. The close connection mode used is NORMAL. I'm using a very simple mapfile with only 1 raster layer (ECW file of ~1.2G) and MapServer memory usage grow by 25-30MB every request.
I tested GDAL 1.5.4 and 1.6.3 and currently compiling 1.7.0b1. (Will report later)
Attachments (1)
Change History (15)
by , 13 years ago
comment:1 by , 13 years ago
Version: | unspecified → svn-trunk (development) |
---|
I can also reproduce the bug using GDAL 1.7.0b1.
comment:2 by , 13 years ago
Milestone: | → 5.6.1 release |
---|
comment:3 by , 13 years ago
Cc: | removed |
---|---|
Owner: | changed from | to
This should likely be a GDAL ticket.
I have tried running my build like:
valgrind --tool=memcheck --track-fds=yes --num-callers=16 --show-reachable=yes \ --leak-check=yes --suppressions=/home/warmerda/bin/valgrind.sup --leak-resolution=high \ gdalinfo -mm downtown.ecw
and a leak summary was produced like:
==7953== LEAK SUMMARY: ==7953== definitely lost: 8,193 bytes in 1 blocks. ==7953== possibly lost: 860 bytes in 20 blocks. ==7953== still reachable: 2,916 bytes in 20 blocks. ==7953== suppressed: 1,520 bytes in 15 blocks.
So, in this situation where I did actually read over the whole 4K x 4K image I did not see a memory leak anything like you are seeing.
Could you specify what ECW SDK you are using? Do you see problems with a single shp2img or mapserv or gdalinfo -mm run with valgrind?
comment:4 by , 13 years ago
Actually, we forgot to mention that the leak doesn't show up in Valgrind. We saw that small 2kB leak as well but that's nothing compared to the MB's which are piling up on every request. We use top to watch the fcgi process and see it growing with every request.
I'm not sure which version of the ECW SDK is used, but I would guess it must be libecw_3.3-2.tar.gz since that's what I have and I got it from Alan. Alan is out of the office today so he may reply only in a few days.
comment:5 by , 13 years ago
I confirm that the version of libecw used is 3.3-2. As Daniel said, I can see a small leak (2-4k) if I run a single mapserv request with valgrind. Are you able to reproduce the bug in fastcgi mode ? (If you don't have an ready to test fastcgi environnement, I have a virtual machine, on which I could create you an account)
comment:6 by , 13 years ago
Status: | new → assigned |
---|
I have confirmed the problem with PROCESSING "CLOSE_CONNECTION=NORMAL" using "shp2img -c 150 -m downtown.map -o out.png" so the problem is not specifically related to fastcgi.
Digging in...
comment:7 by , 13 years ago
Problem also observed with this script:
from osgeo import gdal for i in range(100): ds = gdal.Open('downtown.ecw') ds.ReadRaster(0,0,1000,1000) ds = None
So the problem is not MapServer, it is GDAL or the ECW SDK. The same problem does not occur with a TIFF file as input.
comment:8 by , 13 years ago
Component: | MapServer C Library → GDAL Support |
---|---|
Milestone: | 5.6.1 release |
comment:9 by , 13 years ago
Findings so far.
1) the increase in memory as reported by top appears to only be in the virtual memory column and does not appear to have any real effect on the running process on my 64bit system. I let it run up to about 32GB of virtual memory with no apparent ill effect. I can imagine on a 32bit system things might fail at the point 2 or 4GB of virtual memory were mapped.
2) The problem seems to be the following mmap() from within pthread_create() which has no corresponding munmap(). The ECW SDK creates a thread for each file opened.
mmap(NULL, 8392704, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS|0x40, -1, 0) = -1 ENOMEM (Cannot allocate memory) mmap(NULL, 8392704, PROT_READ|PROT_WRITE|PROT_EXEC, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x2aaaccaf4000 mprotect(0x2aaaccaf4000, 4096, PROT_NONE) = 0 clone(child_stack=0x2aaacd2f4260, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0x2aaacd2f49e0, tls=0x2aaacd2f4950, child_tidptr=0x2aaacd2f49e0) = 13993
3) I have confirmed that the thread is being destroyed in a timely fashion with a call to pthread_exit(). I is unclear to me why this does not result in recovery of all the resources though I do observe that NCSThreadFreeInfo() is never called, nor is pthread_join().
4) Calling NCSecwShutdown(); after each file is closed does appear to recover the memory. I have not established why. However, this function can take several seconds due to something funky about how it waits for pending threads so it is currently not ever called by the GDAL ECWDataset implementation.
I shall continue to review what NCSecwShutdown() does and see if there is a practical way of getting the thread resources cleaned up.
comment:10 by , 13 years ago
Cc: | added |
---|
comment:11 by , 13 years ago
Resolution: | → fixed |
---|---|
Status: | assigned → closed |
It appears the problem was that CNCSJP2FileView::sm_Views which derived from CNCSThread was calling Spawn() each time a new fileview was created, but the threadinfo corresponding threadinfo structure was never being freed until final application shutdown when the whole threadinfo table was cleared.
There were several possible ways to deal with this, but I tried to take a fairly harmless approach as seen in this patch:
*** NCSThread.cpp.orig 2010-01-06 19:54:47.000000000 -0500 --- NCSThread.cpp 2010-01-06 19:44:10.000000000 -0500 *************** *** 47,52 **** --- 47,58 ---- //void **pStartData = (void **)NCSMalloc(2 * sizeof(void*), TRUE); //pStartData[0] = (void*)this; //pStartData[1] = pData; + if( m_Thread != NCS_NULL_THREAD_ID ) + { + //fprintf( stderr, "Cleanup old threadinfo in spawn()\n" ); + FreeInfo(); + m_Thread = NCS_NULL_THREAD_ID; + } m_pData = pData; m_bRun = true;
NCSThread.cpp is in libecw2-3.3/Source/C/NCSUtil.
The above basically ensures that FreeInfo() is called if a thread is initialized and active before Spawn() will hide it by generating a new m_Thread value.
This appears to resolve the issue in limited testing.
comment:12 by , 13 years ago
I have referenced this change in http://trac.osgeo.org/gdal/wiki/ECW and GDAL ticket 3161.
comment:13 by , 13 years ago
Great! I confirm that your patch fixed the bug.
However, these was still a small leak of 4K (that we talked a few time ago). I applied the following patch to libecw to fix it:
--- libecwj2-3.3/Source/C/tinyxml/tinyxml.cpp 2006-07-02 21:15:12.000000000 -0400 +++ libecwj2-3.3_patched/Source/C/tinyxml/tinyxml.cpp 2010-01-07 08:26:26.000000000 -0500 @@ -1020,6 +1020,8 @@ if ( fread( buf, length, 1, file ) != 1 ) { //if ( fread( buf, 1, length, file ) != (size_t)length ) { + delete [] buf; + buf = 0; SetError( TIXML_ERROR_OPENING_FILE, 0, 0, TIXML_ENCODING_UNKNOWN ); fclose( file ); return false;
Thanks a lot warmerdam for your work!
comment:14 by , 13 years ago
Thanks a lot for your amazing work on this one Frank!
For the record, the GDAL ticket referencing this issue (and more) is 3162, not 3161 http://trac.osgeo.org/gdal/ticket/3162
Are you planning on sending a note to the libecw devs about this? I believe they were made aware of ticket 3162 back on 2007, but it would not hurt to refresh their memory. I could take care of it if you have some contact info you can send me, but I would think the reminder would have more chances of getting noticed if it comes from you than from me.
Thanks again!
The mapfile used to reproduce the bug