Changes between Initial Version and Version 1 of rfc63_sparse_datasets_improvements


Ignore:
Timestamp:
Jul 8, 2016, 7:17:11 AM (8 years ago)
Author:
Even Rouault
Comment:

Create RFC 63

Legend:

Unmodified
Added
Removed
Modified
  • rfc63_sparse_datasets_improvements

    v1 v1  
     1= RFC 63 : Sparse datasets improvements =
     2
     3Author: Even Rouault[[BR]]
     4
     5Contact: even.rouault at spatialys.com[[BR]]
     6
     7Status: Development[[BR]]
     8
     9Target version: 2.2
     10
     11== Summary ==
     12
     13This RFC covers an improvement to manage sparse datasets, that is to say datasets that contain substantial empty regions.
     14
     15== Approach ==
     16
     17There are use cases where one needs to read or generate a dataset that covers a large spatial extent, but in which significant parts are not covered by data. There is no way in the GDAL API to quickly know which areas are covered or not by data, hence requiring to process all pixels, which is rather inefficient. Whereas some formats like GeoTIFF, VRT or GeoPackage can potentially give such an information without processing pixels.
     18
     19It is thus proposed to add a new method GetDataCoverageStatus() in the GDALRasterBand class, that takes as input a window of interest and returns whether it is made of data, empty blocks or a mix of them.
     20
     21This method will be used by the GDALDatasetCopyWholeRaster() method (used by CreateCopy() / gdal_translate) to avoid processing sparse regions when the output driver instructs it to do so.
     22
     23== C++ API ==
     24
     25In GDALRasterBand class, a new virtual method is added :
     26{{{
     27 virtual int IGetDataCoverageStatus( int nXOff, int nYOff,
     28                                     int nXSize, int nYSize,
     29                                     int nMaskFlagStop,
     30                                     double* pdfDataPct);
     31
     32/**
     33 * \brief Get the coverage status of a sub-window of the raster.
     34 *
     35 * Returns whether a sub-window of the raster contains only data, only empty
     36 * blocks or a mix of both.
     37 *
     38 * Empty blocks are blocks that contain only pixels whose value is the nodata value when it
     39 * is set, or whose value is 0 when the nodata value is not set.
     40 *
     41 * The query is done in an efficient way without reading the actual pixel
     42 * values. If not possible, GDAL_DATA_COVERAGE_STATUS_UNIMPLEMENTED will be
     43 * returned.
     44 *
     45 * @see GDALGetDataCoverageStatus()
     46 *
     47 * @param nXOff The pixel offset to the top left corner of the region
     48 * of the band to be queried. This would be zero to start from the left side.
     49 *
     50 * @param nYOff The line offset to the top left corner of the region
     51 * of the band to be queried. This would be zero to start from the top.
     52 *
     53 * @param nXSize The width of the region of the band to be queried in pixels.
     54 *
     55 * @param nYSize The height of the region of the band to be queried in lines.
     56 *
     57 * @param nMaskFlagStop 0, or a binary-or'ed mask of possible values
     58 * GDAL_DATA_COVERAGE_STATUS_UNIMPLEMENTED,
     59 * GDAL_DATA_COVERAGE_STATUS_DATA and GDAL_DATA_COVERAGE_STATUS_EMPTY. As soon as
     60 * the computation of the coverage matches the mask, the computation will be
     61 * stopped. pdfDataPct will not be valid in that case.
     62 *
     63 * @param pdfDataPct Optional output parameter whose pointed value will be set
     64 * to the (approximate) percentage in [0,100] of pixels in the queried sub-window that have
     65 * valid values. The implementation might not always be able to compute it, in
     66 * which case it will be set to a negative value.
     67 *
     68 * @return a binary-or'ed combination of possible values
     69 * GDAL_DATA_COVERAGE_STATUS_UNIMPLEMENTED,
     70 * GDAL_DATA_COVERAGE_STATUS_DATA and GDAL_DATA_COVERAGE_STATUS_EMPTY
     71 *
     72 * @note Added in GDAL 2.2
     73 */
     74}}}
     75
     76This method has a dumb default implementation that returns GDAL_DATA_COVERAGE_STATUS_UNIMPLEMENTED.
     77
     78The public API is made of :
     79
     80{{{
     81C++ :
     82
     83int  GDALRasterBand::GetDataCoverageStatus( int nXOff,
     84                                            int nYOff,
     85                                            int nXSize,
     86                                            int nYSize,
     87                                            int nMaskFlagStop,
     88                                            double* pdfDataPct)
     89
     90C :
     91int GDALGetDataCoverageStatus( GDALRasterBandH hBand,
     92                               int nXOff, int nYOff,
     93                               int nXSize,
     94                               int nYSize,
     95                               int nMaskFlagStop,
     96                               double* pdfDataPct);
     97}}}
     98
     99GDALRasterBand::GetDataCoverageStatus() does basic checks on the validity of the window before calling IGetDataCoverageStatus()
     100
     101== Changes ==
     102
     103GDALDatasetCopyWholeRaster() accepts a SKIP_HOLES option that can be set to YES by the output driver to cause GetDataCoverageStatus() to be called on each chunk of the source dataset to determine if contains only holes or not.
     104
     105== Drivers ==
     106
     107This RFC upgrades the GeoTIFF and VRT drivers to implement the IGetDataCoverageStatus() method.
     108
     109The GeoTIFF driver has also receive a number of prior enhancements, related to that topic, for example to accept the SPARSE_OK=YES creation option in CreateCopy() mode (or the SPARSE_OK open option in update mode).
     110
     111Extract of the documentation of the driver:
     112
     113{{{
     114GDAL makes a special interpretation of a TIFF tile or strip whose offset
     115and byte count are set to 0, that is to say a tile or strip that has no corresponding
     116allocated physical storage. On reading, such tiles or strips are considered to
     117be implictly set to 0 or to the nodata value when it is defined. On writing, it
     118is possible to enable generating such files through the Create() interface by setting
     119the SPARSE_OK creation option to YES. Then, blocks that are never written
     120through the IWriteBlock()/IRasterIO() interfaces will have their offset and
     121byte count set to 0. This is particularly useful to save disk space and time when
     122the file must be initialized empty before being passed to a further processing
     123stage that will fill it.
     124To avoid ambiguities with anoter sparse mechanism discussed in the next paragraphs,
     125we will call such files with implicit tiles/strips "TIFF sparse files". They will
     126be likely '''not''' interoperable with TIFF readers that are not GDAL based and
     127would consider such files with implicit tiles/strips as defective.
     128
     129Starting with GDAL 2.2, this mechanism is extended to the CreateCopy() and
     130Open() interfaces (for update mode) as well. If the SPARSE_OK creation option
     131(or the SPARSE_OK open option for Open()) is set to YES, even an attempt to
     132write a all 0/nodata block will be detected so that the tile/strip is not
     133allocated (if it was already allocated, then its content will be replaced by
     134the 0/nodata content).
     135
     136Starting with GDAL 2.2, in the case where SPARSE_OK is '''not''' defined (or set
     137to its default value FALSE), for uncompressed files whose nodata value is not
     138set, or set to 0, in Create() and CreateCopy() mode, the driver will delay the
     139allocation of 0-blocks until file closing, so as to be able to write them at
     140the very end of the file, and in a way compatible of the filesystem sparse file
     141mechanisms (to be distinguished from the TIFF sparse file extension discussed
     142earlier). That is that all the empty blocks will be seen as properly allocated
     143from the TIFF point of view (corresponding strips/tiles will have valid offsets
     144and byte counts), but will have no corresponding physical storage. Provided that
     145the filesystem supports such sparse files, which is the case for most Linux
     146popular filesystems (ext2/3/4, xfs, btfs, ...) or NTFS on Windows. If the file
     147system does not support sparse files, physical storage will be
     148allocated and filled with zeros.
     149}}}
     150
     151== Bindings ==
     152
     153The Python bindings has a mapping of GDALGetDataCoverageStatus(). Other bindings could be updated (need to figure out how to return both a status flag and a percentage)
     154
     155== Utilities ==
     156
     157No direct changes in utilities.
     158
     159== Results ==
     160
     161With this new capability, a VRT of size 200 000 x 200 000 pixels that contains 2 regions of 20x20 pixels each can be gdal_translated as a sparse tiled GeoTIFF in 2 seconds. The resulting GeoTIFF can be itself translated into another sparse tiled GeoTIFF in the same time.
     162
     163== Future work ==
     164
     165Future work using the new capability could be done in overview building or warping. Other drivers could also benefit from that new capability: GeoPackage, ERDAS Imagine, ...
     166
     167== Documentation ==
     168
     169The new method is documented.
     170
     171== Test Suite ==
     172
     173Tests of the VRT and GeoTIFF drivers are enhanced to test their IGetDataCoverageStatus() implementation.
     174
     175== Compatibility Issues ==
     176
     177C++ ABI change. No functional incompatibility foreseen.
     178
     179== Implementation ==
     180
     181The implementation will be done by Even Rouault.
     182
     183The proposed implementation is in https://github.com/rouault/gdal2/tree/sparse_datasets
     184
     185== Voting history
     186
     187TBD
     188