Opened 2 months ago

Closed 2 months ago

Last modified 2 months ago

#6937 closed defect (fixed)

/vsicurl/ caching causes issues in case of updates followed by read scenarios

Reported by: Even Rouault Owned by: Even Rouault
Priority: normal Milestone: 2.2.1
Component: default Version: unspecified
Severity: normal Keywords: /vsicurl/
Cc:

Description

From the mailing list

My actual problem is a bit more specific then being unable to open S3 files
after upload. The actual problem is that within the same Python session, I
can open a file off S3 with the vsis3 driver, but then if I upload a new
file that previously did not exist (using boto3), gdal does not see it as a
valid file.

What appears to be happening is that once an S3 file is read the contents
of that bucket are read into a cache, but then if an new file is uploaded
in the meantime, trying to then read that file looks in the cache and
doesn't see that file as existing and throws an error. If I recall
correctly GDAL is reading other contents of that bucket/key-prefix because
it's looking accompanying metadata files so is this cached in some way? It
seemed like a plausible explanation but I've been unable to find reference
to such a cache other than potentially VSI_CACHE, but setting that to FALSE
did nothing and my understanding is that it applies to specific datasets,
not bucket contents.

I've managed to replicate the problem in a very simple Python program
below. While both files are uploaded without error (you can use gdalinfo
remotely on both), the attempt to open the second file will throw:
ERROR 4: `/vsis3/pail-of-images/test2.tif' not recognized as a supported
file format.

Calling the script a second time works, because (presumably) even though it
uploads and overwrites both images again, they both exist from the
beginning.

Either this is a bug or it's intended behavior in which case there's
hopefully some way to change it to force to reread a bucket when trying to
open a file. My current workaround is to change the behavior of my app to
upload all images first before accessing, but this seems unsatisfactory,
not to mention it wreaks havoc with my tests which don't assume such
behavior.

########################
#!/usr/bin/env python3

from osgeo import gdal
import boto3

filenames = [
    'file1.tif',
    'file2.tif'
]

bucket = 'pail-of-images'

s3 = boto3.resource('s3')
for f in filenames:
    print('Uploading %s to %s' % (f, bucket))
    s3.meta.client.upload_file(f, bucket, f)
    uri = '/vsis3/%s/%s' % (bucket, f)
    print('Opening %s' % uri)
    ds = gdal.Open(uri)
    print(ds.GetMetadata())
    ds = None
##########################

Change History (2)

comment:1 Changed 2 months ago by Even Rouault

Resolution: fixed
Status: assignedclosed

In 39223:

Add a VSICurlClearCache() function (bound to SWIG as gdal.VSICurlClearCache()) to be able to clear /vsicurl/ related caches (fixes #6937)

comment:2 Changed 2 months ago by Even Rouault

In 39224:

Add a VSICurlClearCache() function (bound to SWIG as gdal.VSICurlClearCache()) to be able to clear /vsicurl/ related caches (fixes #6937)

Note: See TracTickets for help on using tickets.