Opened 13 years ago
Closed 9 years ago
#4276 closed defect (invalid)
Memory leak when importing raster with AIG driver
Reported by: | gdalnovice | Owned by: | warmerdam |
---|---|---|---|
Priority: | high | Milestone: | |
Component: | GDAL_Raster | Version: | 1.8.0 |
Severity: | critical | Keywords: | MacOS |
Cc: | hobu, kyngchaos |
Description
Hi,
I am using GDAL in Python to do some computations. I iterate over multiple ArcGIS rasters and import them in order to use Numpy to do very simple calculations. I have noticed that there is a huge memory leak in doing this. My rasters are 2-3GB each and at each iteration I loose 2-7GB memory. Obviously this crashes the computer.
I am using GDAL 1.8 and Python 2.7.1 on a MacPro with 64GB memory and 12 cores. I had to employ Mac's "purge" command to release the inactive memory generated by the leak, but it would be nice to be able to do this without having to liberate memory at each iteration.
I am a novice so I have not been able to determine the exact step at which the memory loss occurs, but if I iterate on the same two files over and over, the memory leak diminishes or even disappears. So I'm guessing it is related to the import function. Here is the code I am using:
#Import tools to be used
import scikits.image.graph as graph
import os, sys, time
from osgeo import gdal, gdalconst
from osgeo.gdalconst import *
import numpy
from numpy import *
codes = open('countrycap.txt','r')
for line in codes:
# setup driver to import and export files and GDAL options
driver = gdal.GetDriverByName('AIG')
driver.Register()
pais = line.rstrip()
filepais = open('+pais+'.txt','w')
filepais.write('code,DiffWeekMean,DiffMax,DiffMin,DiffSum \n')
#define country for which to do the computation
file1='/Paths/'+pais+'dist'
file2='/paths/'+pais+'dist'
iso1 = gdal.Open(file1, GA_ReadOnly)
iso2 = gdal.Open(file2, GA_ReadOnly)
print('Opened...')
print(file1)
print(file2)
# Get data about rasters
cols1 = iso1.RasterXSize
rows1 = iso1.RasterYSize
band1 = iso1.GetRasterBand(1)
nodataval=band1.GetNoDataValue()
proj = iso1.GetProjection()
band2 = iso2.GetRasterBand(1)
#Get Data
data1 = band1.ReadAsArray(0, 0, cols1, rows1)
data2 = band2.ReadAsArray(0, 0, cols1, rows1)
mask1=numpy.greater_equal(data1,0)
data3=numpy.choose(mask1,(nodataval,(data1-data2)/(7*24)))
mask2=numpy.less(data1,0)
data4=numpy.ma.masked_array(data3,mask=mask2)
media=data4.mean()
minimo=data4.min()
maximo=data4.max()
suma=data4.sum() print(str(pais)+','+str(media)+','+str(maximo)+','+str(minimo)+','+str(suma)+' \n')
filediff.write(str(pais)+','+str(media)+','+str(maximo)+','+str(minimo)+','+str(suma)+' \n')
filepais.write(str(pais)+','+str(media)+','+str(maximo)+','+str(mini
mo)+','+str(suma)+' \n')
filepais.close()
# figure out how long the script took to run
endTime = time.time()
print 'The script has taken ' + str((endTime - startTime)/60) + ' minutes'
del iso1
del iso2
del data1
del data2
del data3
del data4
del mask1
del mask2
del band1
del band2
del rows1
del cols1
del nodataval
del proj
del media
del minimo
del maximo
del suma
del driver
os.system("purge")
filediff.close()
endTime = time.time()
print 'The script took ' + str((endTime - startTime)/60) + ' minutes'
Change History (8)
comment:1 by , 13 years ago
comment:2 by , 13 years ago
Cc: | added |
---|
I've tried your script (slightly edited) on Linux with a small AIG dataset, that can be found here http://trac.osgeo.org/gdal/browser/trunk/autotest/gdrivers/data/abc3x1 , and observed no leak.
So either there's something particular with your AIG dataset, that is not triggered with the one I tried, either there's something very particular with the way MacOS handles memory.
One hypothesis is that MacOS caches freed memory a bit too agressively, which is problematic is you read huge buffer of 2 GB, and that this "purge" commande forces it to release. If GDAL really leaked memory, the purge memory wouldn't be able to recover from it, from my understanding of what I've read about the purge command.
So perhaps you could try not to read the whole raster at a single time, but proceed line by line for example and see if it makes any difference.
CC'ing Howard which has more experience with MacOS than me. (mine is 0 ;-))
comment:3 by , 13 years ago
Thanks hobu for looking at this so promptly.
Some notes on what you did.
1.) I had previously tried the same code iterating 120 times on only 1 file and it didn't crash. For some reason it seemed to only leak memory once and did not accumulate the leak over iterations. So it doesn't surprise me that you did not see any leak in the exercise you did, especially since your file is so much smaller.
2.) By deleting some of the structures I was able to decrease the amount of memory leaked, but never got to have it stop leaking when I do the ReadAsArray command. (I wonder if it is caused by the lack of a close() command in Python?)
3.) Interestingly, if I stopped the process after iterating over X files, once memory had leaked and restarted the iteration over the same group of files it had done previously, it leaked very little...only once it iterated over files it had not done so before (X+1,X+2,...), would the leakage increase again.
So, as you can see it doesn't seem to depend on the fact that I read a huge file. Still, I had a version of the script that did the same processing but block by block (way slower) and I seem to remember I had similar issues. Once my computer finishes some computations I'll try that one again and see if it causes the same problems. I'll let you know.
comment:4 by , 13 years ago
Actually it was my comment (rouault), not Howard's (=hobu) ;-) I just added him in the CC list.
About 1), I think you are perhaps seing the effect of the GDAL block cache. But it is limited to 40 MB shared among all opened dataset, and when a dataset is closed, the blocks related to it that were in the block cache are freed. That should be insignificant.
About 3), it really does sound as a OS issue that does too agressive file caching. If you kill the process, then it is the responsibility of the OS to make the memory it used (or even leaked) available to new processes.
I've retried your script with much larger datasets, and still did not see any leak.
So I'm afraid I can't reproduce your issue and you have to investigate on MacOS behaviour related to file content caching.
comment:5 by , 13 years ago
Hmmm...no idea how to proceed here. In any case if anyone else is having the same issue they can solve it in the meantime using the "purge" command.
If someone has a guide on how to tackle this in order to find the problem let me know. I'm more than happy to help, I just don't know where to start.
comment:6 by , 9 years ago
Keywords: | MacOS added |
---|
Are there any MacOS developers out there to say something about this memory leak issue? I do not know whom to CC.
comment:7 by , 9 years ago
Cc: | added |
---|
CC'ing William Kyngesburye (aka kyngchaos) who maintains a MacOSX GDAL stack if he wants to try reproducing this old issue. Otherwise we might just close it as worksforme.
Doing it line by line I notice that the memory leak seems to come from the ReadAsArray command.