Changes between Version 39 and Version 40 of WKTRaster/GDALDriverSpecificationWorking


Ignore:
Timestamp:
Sep 8, 2011, 5:11:46 PM (13 years ago)
Author:
jorgearevalo
Comment:

Reading/Writing data section cleared.

Legend:

Unmodified
Added
Removed
Modified
  • WKTRaster/GDALDriverSpecificationWorking

    v39 v40  
    110110  * Region oriented r/w: The driver reads/writes arbitrary regions of data. It's a potentially less efficient method, because you have to take care of '''data type translation''' if the data type of the buffer is different than that of the GDALRasterBand. You also must takes care of '''image decimation / replication''' if the buffer size (nBufXSize x nBufYSize) is different than the size of the region being accessed (nXSize x nYSize). To use this method, your driver must provide an implementation of [http://www.gdal.org/classGDALRasterBand.html#5497e8d29e743ee9177202cb3f61c3c7 IRasterIO].
    111111
    112  '''Pierre question:''' In which case "the data type of the buffer is different than that of the GDALRasterBand"?
    113 
    114  '''Pierre question:''' In which case "the buffer size (nBufXSize x nBufYSize) is different than the size of the region being accessed (nXSize x nYSize)"?
    115 
    116  '''Jorge''': Ok, these are not use cases of our driver. The point is: if you implement IRasterIO, you must deal with the fact that maybe the caller is using arbitrary data type and buffer size to call your implementation (these are function's arguments). We could simply detect this situation by comparing the data type and buffer size provided as arguments with the known data type and raster size, and raise an error if needed. But I think if the caller uses an integer buffer and we have a raster with floating point values, for example, we still should provide the raster data and warn the user about possible unaccurate values, because of truncating oepration.
    117 
    118  '''Pierre question:''' Do we agree that "Region oriented" is a general case of "''Natural'' block oriented" and that IReadBlock can be implemented as a wrapper around IRasterIO?
    119 
    120  '''Jorge''': Agree. But taking into account the previous comment.
    121 
    122 Clearly, there's no best method for reading/writing data in our case. In the ideal case of regulary blocked rasters, with no overlapping and same grid for all tiles, the block oriented r/w is the more appropiate strategy. But in the rest of the cases, a more general r/w method must be provided.
    123 
    124 Currently, the natural block oriented r/w method is the one implemented for the driver. This is a limitation for 2 reasons:
    125   * Obviously, it only fits one raster arrangement
    126   * Each !ReadBlock call forces a new server round, constructing a Box and getting the raster row that contains it. This can be really slow, in case of huge raster coverages (question raised in ticket #497 too).
    127 
    128  '''Pierre question:''' Why does each IReadBlock force a new server round and IRasterIO not?
    129 
    130 '''Open question''': How to get the needed metadata in case of ONE_RASTER_PER_TABLE arrangement. As argued in ticket #497, executing ''ST_Extent'' or ''ST_Metadata'' without limits over a big table can be a really heavy process.
    131 
    132  '''Pierre question:''' How long take a ST_Extent query on a 1 000 000 tiles indexed table?
    133 
    134   '''Jorge''': Maybe acceptable time. But 1 000 000 server calls to get the raster data portion surrounded by the box constructed with 4 coords (each IReadBlock round) is too slow, I think. Each IReadBlock call may imply a simple ''select st_band(rast, nband) from table where...'' in case of regularly blocked rasters (1 natural GDAL block = 1 raster table row), but the general case (to be implemented in IRasterIO) is different. The caller may want a region covered by 2 raster tiles. Ok, right now we simply support this particular case (regularly blocked raster), but for the more general case, we'd need a ''st_intersection'' version returning raster data, IMHO.
    135 
    136    '''Pierre''': But how can we avoid the fact that IReadBlock fetch one tile at a time? There seems to be no way around it. It seems to me that in our case IRasterIO is generally more efficient in fetching raster data because it can fetch a larger ectent than a simple tile. Right? How does GDAL decide when to use IReadBlock or IRasterIO? Can't we force it to use IRasterIO all (most) of the time?
    137 
    138    You could use a future ST_Union() (not ST_Intersection()) to merge many tiles into one filling the need of IRasterIO but you can also implement your more simple "burn-the-last-fetched-tile-at-the-proper-location" algorythm.
    139 
    140       '''Jorge''': The I/O system in GDAL works like this (thanks to Even Rouault, who provided me this explanation some time ago): The GDALRasterBand object has an array of pointers to blocks (papoBlocks). Initially, all the pointers are set to NULL.  When a I/O operation over a band is issued, the default implementation in GDALRasterBand::IRasterIO() will divide the requested source window into a list of blocks matching the block size of the band. It will call the GetLockedBlockRef() method of GDALRasterBand to see if there's already a matching block stored in papoBlocks. If yes, the caller will fetch the pixel buffer from the cached block. If not, it will call the IReadBlock() method on the band (a particular driver's implementation. Our driver provides one) and will store the resulting buffer in a new GDALRasterBlock object.
    141 
    142       So, GDAL performs its I/O operation in a block-oriented mode. It may be natural to think that a GDAL block = PostGIS Raster tile, and you only need to provide a IReadBlock implementation, but this is a limited point of view. Only works with regular blocking arrangement, like now.
    143 
    144       We could get rid of IReadBlock and implement our own IRasterIO version, as you suggested. This IRasterIO implementation would perform the query to get the data. Again, we could call IReadBlock only in case the requested region size matches the tile size, but it would only work if all the tiles have the same size. GDAL block system is not really intended to deal with different size's blocks.
    145 
    146       Another important thing is IRasterIO doesn't make any assumption about data buffer size or data type. So, the function must check if the buffer size is enough to store the requested raster data (in case of data read), and if not, use the proper overview to read the data (in case overviews exist), or resample the raster data to fit into the buffer. You must check the data type too. If the raster data type and the buffer data type are different, you must perform the data type translation.
    147 
    148       There are some useful functions to deal with these problems, like GDALCopyWords or GDALBandGetBestOverviewLevel. But we're, in some way, breaking the GDAL I/O philosophy.
    149 
    150 
    151 '''Open question''': What should be the general r/w algorithm?
    152 jorgearevalo: I think the strategy ''read as much data as you can'' should be the right one, to minimize server rounds. This is: construct a query that, using ''ST_Intersects'', fetches as much rows as possible. This query would be executed in ''IRasterIO'' method. But I don't know how to choose the geographic limits for the query (how much data is ''as much data as you can''?)
    153 
    154  '''Pierre question:''' What do you mean by "I don't know how to choose the geographic limits"? Can't you deduce them from the parameter passed to the function and the metadata of the rasters?
    155 
    156    '''Jorge:''' Yes, deprecated question. Forget it.
    157 
    158 
    159 ----
    160 
    161 
    162 === Implementation ===
     112We choose the region oriented approach. Please read the Pierre suggested pseudocode and Jorge's comments:
    163113
    164114'''Pierre:''' Can't it not be as simple as the pseudo code below? In the best case the required blocks fits what is in the table and everything is optimized. If not it is slower. We don't have to know in advance whether the table is regularly tiled or not.
    165115
    166 --------------------------------
     116{{{
    167117GDALRasterBand::IRasterIO(required tile metadata) {
    168118
     
    180130
    181131}
     132}}}
    182133
     134'''Jorge:''' Basically correct, but I need to add some notes here:
     135
     136Basic GDAL I/O follow this schema: GDALDataSet::RasterIO --> GDALRasterBand::RasterIO --> Divide requested window in blocks, matching GDAL block size, specified in band constructor --> Look for block data in cache --> (if data not in cache) ReadBlock --> IReadBlock (your driver's implementation)[[BR]]
     137
     138A lot of formats raster formats use tiles, and GDAL is prepared for that, with the ''block'' concept. So, if a raster coverage is dividied in tiles, GDAL defines ''natural block size'' as ''the block size that is most efficient for accessing the format'', understanding ''block'' as ''raster tile''. For this reason, most drivers implement its own version of IReadBlock method. They delegate the I/O system in GDAL core, and only provide blocks of data when required. GDAL, besides, provides a block cache.[[BR]]
     139
     140The above scheme has a problem: you need a fixed block size. All the blocks of a given coverage have the same width and height. And the regularly blocked scheme is only one of the possible data arrangements PostGIS Raster provides.[[BR]]
     141
     142To allow all the possible data arrangements, the PostGIS Raster driver must go deeper on GDAL I/O system. It must implement its own version of IRasterIO method, at Dataset level. For two main reasons:
     143  * IRasterIO is a GDALDataset method that allows I/O of an arbitrary size region of the data. This is useful in cases where the blocked I/O isn't the optimal way or even it isn't possible. This is the PostGIS Raster's case (withouht a fixed block size, there's no blocked I/O)
     144  * Getting all the data in a given region (no matter how many raster rows are that) implies only one SQL query. And minimize the server round is interesting to PostGIS Raster (more speed, better performance)[[BR]]
     145
     146So, basically the psc for IRasterIO is right, but as GDAL is prepared for a blocked I/O with same size blocks and PostGIS Raster can't ensure all the coverage tiles have the same size, we must take a decission:
     147  * Get rid of the GDAL block concept and simply get all the requested data in a query from IRasterIO method. That implies getting rid of GDAL block cache too.
     148  * Adapt our ''messy'' arrangement to GDAL block system, by dividing the data fetched from database in equally sized blocks, and manually adding them to the cache. An important thing here is what happens when some raster data, stored in the cache, is modified. Using the GDAL cache system, you can simply replace the data in the cache and mark the block as ''dirty'' (using GDALRasterBlock::MarkDirty method), but... '''Q: Are the dirty blocks written to disk (database, in our case) by IWriteBlock method? If not, how?'''
     149
     150After a conversation between Pierre Racine and Jorge Arevalo, we agree on the approach of getting all data in IRasterIO, dividing it into equally sized blocks and storing them in the cache. Jorge Arevalo is working on it (September 2011).
     151
     152
     153About the IReadBlock method:
     154
     155{{{
    183156GDALRasterBand::!ReadBlock(block x & y){
    184157
     
    187160  call IRasterIO(required tile metadata)
    188161}
     162}}}
    189163
    190 
    191 
    192 '''Jorge:''' I think we should get rid of IReadBlock, because the argument above. About the IRasterIO psc, yes, I think is mostly correct.
    193 
     164We should get rid of it, for the reasons explained above. We aren't working with equally sized rasters, in general. And even in that case, getting all the requested data in one query with IRasterIO method is a better approach (faster).