Opened 13 years ago

Closed 11 years ago

Last modified 11 years ago

#1369 closed enhancement (fixed)

read gzipped files in situ (e.g. raster.tif.gz)

Reported by: matt.wilkie@… Owned by: warmerdam
Priority: low Milestone: 1.6.0
Component: GDAL_Raster Version: 1.4.0
Severity: minor Keywords: zip gzip
Cc:

Description (last modified by warmerdam)

I've run into a fair amount of raster imagery, OnEarth? for example, which is available gzipped instead of using an internal compression. It would be very convenient if gdal could read gzipped files in situ. Having to decompress the files to a scratch space first means having to keep a lot extra room around to manouver. In other applications which do read .gz files such as VTP it also seems to be a lot faster to read/write a .gz than an uncompressed file. Perhaps because of less disk use (?).

Attachments (3)

gdal_svn_trunk_gzip.patch (28.1 KB) - added by Even Rouault 12 years ago.
Add capability of reading .gz files transparently from GDAL drivers
gdal_svn_trunk_gzip_faster_random_seek.patch (29.7 KB) - added by Even Rouault 12 years ago.
gdal_svn_trunk_gzip_and_zip.patch (124.1 KB) - added by Even Rouault 12 years ago.

Download all attachments as: .zip

Change History (13)

comment:1 Changed 13 years ago by warmerdam

A special file handler something like the one used for in-memory files could
be implemented to achieve this.  However, it is not partiuclarly high on my
personal priority list.  Due to challenges seeking in compressed data
streams it might be necessary to decompress the whole file into RAM which 
would not scale well to large datasets. 

comment:4 Changed 12 years ago by warmerdam

Description: modified (diff)
Priority: highestlow

comment:5 Changed 12 years ago by Even Rouault

Here's a patch that adds a new class, VSIGZipHandle, which implements the large file API and enable the transparent use of gzip'd file by drivers using the large file API. It only supports reading.

In VSIFOpenL, we look for the first 2 bytes. If they match the magic header of GZip file, we wrap the poVirtualHandle with a VSIGZipHandle object.

It required a change in GDALOpenInfo to first try to open the file with the large file API, instead of first trying with the old file API. This way the drivers will see the uncompressed stream as pabyHeader instead of the compressed one.

Tested successfully with GTiff and NITF drivers. A bit slower than access to uncompressed file on a MODIS and Ikonos TIFF image, but usable. A gdalinfo on a .tif.gz is slow since for some reason the GTiff driver seeks nearly to the end of the TIF file.

I didn't do any effort to make seeking fast, so depending on the way the file is accessed, it may be really slow. I've tested it on onearth images, and... it's actually slow.

Changed 12 years ago by Even Rouault

Attachment: gdal_svn_trunk_gzip.patch added

Add capability of reading .gz files transparently from GDAL drivers

comment:6 Changed 12 years ago by warmerdam

Even,

I think we should avoid the changes in GDALOpen and CPLReadDir. Instead I think gzip files should be accessed with the filename prefix /vsizip/ with the remainder of the path being the real path. So the abc.png file in /usr/data/def.zip would have a virtual filename "/vsizip/usr/data/def.zip/abc.png".

This wouldn't address how to navigate into a zip file directly or how to GDALOpen() it directly, but I think that aspect needs to be carefully considered.

Access to TIFF files in .zip files will generally suck because the TIFF format uses a lot of seeking, and typically the directory (the first thing read) is at the end of the file.

comment:7 Changed 12 years ago by Even Rouault

I'm attaching an improved version of the patch that adds quite fast random seek. It is done thanks the regular creation of 'snapshots' by using inflateCopy that can dump the gzip state. Performance when using .tif.gz in openev is thus much improved (after the initial seek at the end of the file that can take some time, but I don't see how this could be avoided) and makes them usable on the test cases I mentionned in my first post.

Frank, as far as your remarks are concerned, this version doesn't take them into account yet as I've a few remarks/questions too :

  • I think we should use rather /vsigzip for GZIP files and keep /vsizip for ZIP (PkZIP) files
  • The syntax /vsizip/usr/data/def.zip/abc.png seems a good idaa of ZIP files. But for gzip ones, IMHO, it doesn't make much sense, except if the .gz is in fact a .tar.gz
  • The syntax /vsigzip/ should be kept intern to GDAL or for advanced users. I think we must find a way such that a "openev foo.xxx.gz" works
  • For a .tar.gz, the user should be able to do "openev /usr/data/def.tar.gz/abc.png"
  • For a .zip, the user should be able to do "openev /usr/data/def.zip/abc.png". But if the .zip is made of a single file (SRTM HGT files can be downloaded as single file ZIP archive for example), "openev /usr/data/def.zip" should work too.

All in all, the main question is : where to put the magic to translate from the "user friendly filename" to the GDAL virtual filename.

Changed 12 years ago by Even Rouault

comment:8 Changed 12 years ago by Even Rouault

Another improved version of the patch that adds support for .zip files. Reading of zip files is done through unzip.c from contrib/minizip in zlib-1.2.3. With the zip file handler, I can read GTiff, BT and CADRG datasets included in .zip files.

I've also added /vsigzip and /vsizip FilesystemHandler?.

However I've still kept/added in VSIFOpenL, VSIFStatL and VSIReadDir the logic that automagically translates from the natural filename (/usr/data/def.zip/abc.png, foo.xxx.gz) to the virtual filename (/vsizip/usr/data/def.zip/abc.png, /vsigzip/foo.xxx.gz). I don't really see how to do it differently.

To sum up the current possibilities, one can do :

  • gdalinfo foo.gz
  • gdalinfo /vsigzip/foo.gz
  • gdalinfo foo.zip (if the zip file contains only 1 file)
  • gdalinfo /vsizip/foo.zip (if the zip file contains only 1 file)
  • gdalinfo foo.zip/foo.XXX
  • gdalinfo /vsizip/foo.zip/foo.XXX

comment:9 Changed 12 years ago by Even Rouault

Patch updated with changes required in srtmhgtdataset.cpp such as .hgt.zip downloaded from ftp://e0srp01u.ecs.nasa.gov/srtm/version2 can be read as such.

comment:10 Changed 12 years ago by Even Rouault

Patch updated to support stored (uncompressed) file in .zip

Changed 12 years ago by Even Rouault

comment:11 Changed 11 years ago by Even Rouault

Keywords: zip gzip added
Milestone: 1.6.0
Resolution: fixed
Status: newclosed

I've commited from r15211 to r15221 the necessary code to read data directly from .gz and .zip. 100% based on gdal_svn_trunk_gzip_and_zip.patch + some minor improvements.

Except, I didn't commit the "magic" parts in cpl_vsil.cpp that autodetect that the passed file is a .gz or .zip and prepend the right prefix in front of the file.

So the following will work :

gdalinfo /vsigzip/foo.gz 
gdalinfo /vsizip/foo.zip (if the zip file contains only 1 file) 
gdalinfo /vsizip/foo.zip/foo.XXX

Import note : the drivers must support VSI*L API to use those new capabilities.

comment:12 Changed 11 years ago by maphew

thank you!

Note: See TracTickets for help on using tickets.