Opened 10 years ago

Closed 10 years ago

#5324 closed defect (invalid)

/vsicurl/ does not recognise format of some (.jpg) files when VSI_CACHE=TRUE

Reported by: torsti Owned by: warmerdam
Priority: normal Milestone:
Component: default Version: 1.10.1
Severity: normal Keywords: vsicurl
Cc:

Description

gdal reports that a .jpg file is not recognized, when accessing it with /vsicurl/ and VSI_CACHE is set to true. If VSI_CACHE is not set or is false it works just fine.

The error when you try to access the image with gdalinfo:

VSI_CACHE=ON gdalinfo /vsicurl/http://vanhatpainetutkartat.maanmittauslaitos.fi/download.php?file=21_Peruskartta_20k/2/2042/204210/204210_1987.jpg
ERROR 4: `/vsicurl/http://vanhatpainetutkartat.maanmittauslaitos.fi/download.php?file=21_Peruskartta_20k/2/2042/204210/204210_1987.jpg' not recognised as a supported file format.

I haven't tried with other formats or accessing OGR formats with /vsicurl/.

Platform: Debian amd64 jessie/sid

Change History (5)

comment:1 by Jukka Rahkonen, 10 years ago

Hi,

I believe the reason for this behaviour may be that the URL does not point directly to jpeg file but to a web service http://vanhatpainetutkartat.maanmittauslaitos.fi/download.php? that takes the file name as a parameter. This request pointing to a real physical jpeg file works for me with VSI_CACHE=TRUE

/vsicurl/http://latuviitta.org/documents/nodata_alpha.jpg

It is usually good to discuss about this kind of observations on the gdal-dev mailing list before creating tickets.

-Jukka Rahkonen-

comment:2 by torsti, 10 years ago

Summary: /vsicurl/ does not recognise .jpg file when VSI_CACHE=TRUE/vsicurl/ does not recognise format of some (.jpg) files when VSI_CACHE=TRUE

My mistakes, both being lazy (just testing one place with the jpgs) and assuming the same kind of cross posting to -dev as on trac.osgeo.org/grass. Sorry for that. But now that I already opened a ticket I'll keep my posting to here.

Looking at what happens with CPL_DEBUG and CPL_CURL_VERBOSE on it seems the immediate difference is that the GetFileList response is not understood/served properly with the php service. Still, with VSI_CACHE off the .jpg is downloaded and related files (ovr, aux.xml etc.) are tried though they are not available and after that everything works. With the cache on CPL_DEBUG prints after VSICURL: GetFileList:

GDAL_netCDF: 
=====
Open() filename=/vsicurl/http://vanhatpainetutkartat.maanmittauslaitos.fi/download.php?file=21_Peruskartta_20k/2/2041/204112/204112_1957.jpg

and then you get the above error 4 stuff. I don't really understand how it gets from VSICachedFile() to the netCDF driver. It seems unlikely that whether or not the file list is available would affect what driver to use based on whether the cache is used or not, so it must be something else.

comment:3 by Jukka Rahkonen, 10 years ago

I can see that gdal sends first HEAD http://vanhatpainetutkartat.maanmittauslaitos.fi/download.php?file=21_Peruskartta_20k/2/2042/204210/204210_1987.jpg and then it wants to read the contents of directory with GET http://vanhatpainetutkartat.maanmittauslaitos.fi/download.php?file=21_Peruskartta_20k/2/2042/204210/ HTTP/1.1

The latter does not get a directory list because php application does not return such. Now with VSI_CACHE=TRUE GDAL just quits but with VSI_CACHE=FALSE it

  • tries to read Range: bytes=0-16383 of the jpg image but gets it all because php does not allow range access
  • tries with HEAD+GET requests to read possible auxiliary files
  • reads the jpg image again with Range: bytes=18446744073709535232-18446744073709551615

and gets it all for the second time.

I do not know what should happen with VSI_CACHE=TRUE and why the whole jpg image gets read twice with VSI_CACHE=FALSE, I only report what I see to happen.

-Jukka-

comment:4 by torsti, 10 years ago

Actually the php script does return some form of directory list (at least if you GET from a browser, haven't tried curl) it just serves it with a wrong content type or something. But that of course is the server admin's headache not a gdal/curl problem.

If vsicurl doesn't get a file list it checks that the file actually exists, that and lack of range access might account for reading the file twice.

The HEAD+GET for the other files are probably something coming from the jpg driver.

I don't think it's expected that VSI Cache affects whether vsicurl can initially get the file/resource or what gdal drivers will be used, so this might well be a defect. Just can't figure out any difference between what gdal does with http://latuviitta.org/documents/nodata_alpha.jpg vs /vsicurl/http://vanhatpainetutkartat.maanmittauslaitos.fi/download.php?file=21_Peruskartta_20k/2/2042/204210/204210_1987.jpg except lack of proper Range responses and directory listing.

  • Torsti

comment:5 by Even Rouault, 10 years ago

Resolution: invalid
Status: newclosed
  • The main reason for the observed behaviour is that the server doesn't support range downloading, so /vsicurl/ will never work as expected
  • The fact that VSI_CACHE=YES doesn't work is that the VSI cache layer needs to know the file size, and the server cannot return the file size
  • The file is read twice with /vsicurl/ since there's one opening of the file to read the first 1024 bytes (so the whole file because of lack of range downloading capability), and then a second opening when the JPEG driver has recognized the file header.
  • /vsicurl_streaming/ is a better choice since the JPEG format can be read in streaming mode

--> I don't think any change in the code is needed. /vsicurl/ is already too complex to try to make it work with servers it is not designed to work with.

Note: See TracTickets for help on using tickets.