Context Navigation

Changes between Version 42 and Version 43 of WKTRaster/GDALDriverSpecificationWorking

Timestamp:: Apr 15, 2012, 3:27:29 AM (12 years ago)
Author:: jorgearevalo
Comment:: Specification simplified and uptated after a discussion by mail between the members of raster team

Legend:

: Unmodified
: Added
: Removed
: Modified

WKTRaster/GDALDriverSpecificationWorking

-              v42
+              v43
 }}}
 == '''Current status of the driver (February 2011)''' ==
+== '''Current status of the driver (April 2012)''' ==
 The driver is:
  * Able to read in-db evenly blocked rasters (all blocks with same size)
- * Able to read in-db one-row-rasters:
-   * If the table really has more than one row: using -where clause in connection string
-   * If the table has more than one row: the table must have been marked as "regularly blocked table", with -k in loader
  * Able to generate two kind of raster object based on two modes:
 …
 The driver is not:
  * Able to read out-db rasters (developed, but not tested, and with known bugs)
+ * Able to read out-db rasters
  * Able to create new rasters
  * Able to manage all the [http://trac.osgeo.org/postgis/raw-attachment/wiki/WKTRaster/Documentation01/WKTRasterArrangements.gif PostGIS Raster arrangements]
 …
 === Topic: The basis ===
 The main class of a GDAL driver is [http://www.gdal.org/classGDALDataset.html GDALDataset]: A set of associated raster bands. So, 1 GDALDataset must be able to contain:
+  * An untiled image stored in a raster table's row.
+  * A tiled image stored in a raster table (regular or irregular, rectangular or not, with or without missing tiles, with or without overlapping between tiles)
+  * A raster object coverage from the rasterization of a vector coverage stored in a raster table (regular or irregular, rectangular or not, with or without missing tiles, with or without overlapping between tiles)
+  * ~~ An untiled image stored in a raster table's row ~~
+  * ~~ A tiled image stored in a raster table (regular or irregular, rectangular or not, with or without missing tiles, with or without overlapping between tiles) ~~
+  * ~~ A raster object coverage from the rasterization of a vector coverage stored in a raster table (regular or irregular, rectangular or not, with or without missing tiles, with or without overlapping between tiles) ~~
+'''UPDATE''': As Pierre suggested, there're only 2 arrangements
+  * Regulary tiled raster
+  * Irregulary tiled raster
 In the first case, 1 GDALDataset = 1 PostGIS Raster object. In the other two cases, 1 GDALDataset = Several PostGIS Raster objects. For this reason, the GDAL PostGIS Raster driver has '''2 working modes''': ONE_RASTER_PER_TABLE, ONE_RASTER_PER_ROW.
+Take into account a raster can contain only 1 tile. In that case, 1 GDALDataset = 1 PostGIS Raster object (= 1 PostGIS Raster table row). Otherwise, 1 GDALDataset = Several PostGIS Raster objects (= several PostGIS Raster rows). For this reason, the GDAL PostGIS Raster driver has '''2 working modes''': ONE_RASTER_PER_TABLE, ONE_RASTER_PER_ROW.
 However, currently the driver only deals with continuous tiled raster layers, when all the raster tiles are the same size, snap to the same grid and do not overlap (the ideal case).
 '''Open question''': Are 2 working modes enough to manage all the raster arrangements?
+'''Open question''': Are 2 working modes enough to manage all the raster arrangements? '''[SOLVED]: YES'''
  Pierre: I think yes. We have to distinguish "want we want to produce" from "what we have to deal with". The two modes answer "want we want to produce" and the different table arrangement are "what we have to deal with".
 …
  Then, the driver has to deal with all the possible arrangement of those selected rows in both mode (overlap, gaps, missing tiles, etc...). You tried to enumerate the posssible arrangement above but I think there is only two cases: the tiles are regularly tiled or they are not, whatever the number of tile there is (1 or more). To me the irregular case is a generalization of the first one.
+  Jorge: I think we have 3 cases: untiled raster, regularly tiled raster and irregularly tiled raster.
+  ~~ Jorge: I think we have 3 cases: untiled raster, regularly tiled raster and irregularly tiled raster. ~~[[BR]]
+  Jorge: ok, updated
  If, and only if, you can optimize the regularly tiled case, then you write is as an exception. The problem is to make sure the table is REALLY regularly tiled without relying on the user knowledge. Just the introduction of the -a option to raster2pgsql.py allowing to append tiles to an existing table make the "regularly blocked" flag untrustable. If really we want to maintain this flag we will have to create something like a ST_ValidateRegularBlocking aggregate function.
 …
    Pierre: Then if we can not rely on the raster_columns flag and if a ST_ValidateRegularBlocking() would be too slow, we have to treat "regularly tiled" and "irregularly tiled" as a one unique case hoping that the "regular" one will be faster because it involves less processing when merging the tiles together.
+    Jorge: Agree.
 ----
 …
 To construct a GDALDataset object, the driver must:
   * Open the dataset (create db connection)
+  * Read some data about the dataset (metadata): srid, georeference information, projection information, raster data size, band information (number of bands, pixel size, color interpretation, if present), any other driver-specific dataset related information (i.e.: in our case, schema and table name)
+  * Construct the structure for raster bands, with instances of [http://www.gdal.org/classGDALRasterBand.html GDALRasterBand] class. You need to provide some basic information: data type (pixel size), block size (GDAL contains a concept of the natural block size of rasters so that applications can organize data access efficiently for some file formats) and color interpretation (if any).
+  * Determine, in a 1st, very fast query to the db, by looking in the raster_overview view, what lower resolution table are available for the requested raster table
+  * Determine, in a 2nd, fast enough query to the db, the extent and the maximum number of bands of the requested raster be aggregating the extents of all the rasters. This takes about 1 second on 360000 tiles even if there is no index.
+  * Determine, in a 3rd, very fast query to the db, the pixel size & rotation, the band types and the nodata value for each band of ONLY ONE raster (LIMIT 1). The driver should assume those values will be the same for every other rasters in the table. If when fetching the other tiles, it realizes one does not, we must say that we do not support this arrangement. (I'm still a bit perplex about the nodata value though.)
+The metadata must be read from the raster table, using SQL functions like ST_Extent (used for raster data extent), ST_Metadata (used for general raster metadata) or functions like ST_SRID, ST_Width, ST_Height, etc. When your GDALDataset matches only one raster row (a raster tile) this is not a problem. But when your GDALDataset matches a whole raster table (ONE_RASTER_PER_TABLE mode), you have 2 options:
+  * Call the functions over the whole table and filter the result (i.e.: select distinct st_srid(rast) from raster_table, select distinct st_metadata(rast) from raster table). It can be a really slow operation, but you can check if all tiles are like expected (for example: if they are the same size, if they share the same srid, if they overlap or not, etc)
+ '''Pierre comment''': I think the driver should not try to detect bad raster arrangement with SQL queries. It should just get what it needs from the DB and burn rasters tiles as they come. If it detected that the arrangement is regular then burn them as regular. If they are not, then burn them accordingly.
+  '''Jorge comment''': What does ''accordingly'' mean here? My bet: if the user wants ONE_RASTER_PER_ROW, no problem. Burn one raster file for each tile. If the user wants ONE_RASTER_PER_TABLE and the tiles are not regular, the driver may warn the user and abort or force ONE_RASTER_PER_ROW mode (warning the user first). Any other options?
+   Pierre's response: Accordingly to their geolocation. 1) create an empty raster buffer the size of the whole query area 2)you query all the raster rows (tiles) you need 3) write pixel values from those tiles to the correct location (deduced from the georeference) in the buffer one after the other. That they are regularly tiled or not does not matter. Just write the last raster last, overwriting existing underlapping values.
+   Jorge: Ok. Understood.
+  * Call the functions limiting the output to one result. Fast operation, but may be incorrect
+ '''Pierre comment''': What might be incorrect? There should not be different srid, pixeltype, or pixelsize in the same table. We have have to warn that we do not support this bad arrangement yet.
+  '''Jorge comment''': Related with the previous comment, we could simply warn the user about using that band arrangement and maybe force the ONE_RASTER_PER_ROW arrangement instead.
+Currently, the driver takes the first (and slow) option. That caused performance problems (see ticket #497)
+'''Open question''': How to fetch the information needed to construct the GDALDataset? Pay attention to the fact that '''you are not asking for raster data yet'''. You only need metadata, for constructing the basic GDALDataset object.
+ Pierre: So just knowing the raster extent, the pixelsize and the pixeltype is not sufficient? You could do this with a quick query SELECT ST_Extent(rast), min(ST_BandPixelSize(rast, band)), min(ST_BandPixelType(rast, band)) assuming all tiles have the same pixelsize and pixeltype.
+ Jorge: Yes, it's sufficient. This question is deprecated. Sorry.
+'''Open Question''': If in the first query we find a lower resolution table, does the rest of the work must be performed with this lower resolution table? At least these 3 queries, until we want to read the actual raster data to burn it into the buffer. The queries should be faster in an overview table, but the pixel size will not be the same using an overview table instead the normal resolution table. And you don't read from overviews unless you want to implement decimation because your buffer size is different from your raster size. Am I right?
 ----
 === Topic: !Reading/Writing raster data ===
 Once constructed the basic structure (GDALDataset object and related GDALRasterBand objects), you need to choose the strategy for raster data reading/writing:
+Once constructed the basic structure (GDALDataset object and related GDALRasterBand objects), we can read/write the data, following this general method: Fetch, in a long query, all the rasters along with their world georeferences (upperleftx and upperlefy, width and height) and burn them in the GDAL buffer by converting their world coordinates to the raster coordinates of the buffer.
+  * ''Natural'' block oriented r/w: The driver reads/writes data in equal sized blocks. The potentially more efficient way of r/w data. Really, the natural block size for this dataset is chosen during GDALRasterBand creation. So, '''it's driver's responsibility to provide the desired value for block size'''. To use this method, your driver must provide an implementation of [http://www.gdal.org/classGDALRasterBand.html#09e1d83971ddff0b43deffd54ef25eef IReadBlock].
+ '''Pierre question:''' How is the natural block size choosen in our case?
+ '''Jorge''': By ST_Metadata (width and height).In case of non-regular blocked rasters the function raises an error.
+  * Region oriented r/w: The driver reads/writes arbitrary regions of data. It's a potentially less efficient method, because you have to take care of '''data type translation''' if the data type of the buffer is different than that of the GDALRasterBand. You also must takes care of '''image decimation / replication''' if the buffer size (nBufXSize x nBufYSize) is different than the size of the region being accessed (nXSize x nYSize). To use this method, your driver must provide an implementation of [http://www.gdal.org/classGDALRasterBand.html#5497e8d29e743ee9177202cb3f61c3c7 IRasterIO].
+We choose the region oriented approach. Please read the Pierre suggested pseudocode and Jorge's comments:
+'''Pierre:''' Can't it not be as simple as the pseudo code below? In the best case the required blocks fits what is in the table and everything is optimized. If not it is slower. We don't have to know in advance whether the table is regularly tiled or not.
+More specific:
 {{{
 …
 }}}
+'''Jorge:''' Basically correct, but I need to add some notes here:
+This algorithm must be developed in the implementation of IRasterIO method of the rasterband class. In the best case the required blocks fits what is in the table and everything is optimized. If not it is slower. We don't have to know in advance whether the table is regularly tiled or not.
+Basic GDAL I/O follow this schema: GDALDataSet::RasterIO --> GDALDataset::IRasterIO --> (for each band) GDALRasterBand::RasterIO --> Divide requested window in blocks, matching GDAL block size, specified in band constructor --> Look for block data in cache --> (if data not in cache) ReadBlock --> IReadBlock (your driver's implementation)[[BR]]
+A lot of formats raster formats use tiles, and GDAL is prepared for that, with the ''block'' concept. So, if a raster coverage is dividied in tiles, GDAL defines ''natural block size'' as ''the block size that is most efficient for accessing the format'', understanding ''block'' as ''raster tile''. For this reason, most drivers implement its own version of IReadBlock method. They delegate the I/O system in GDAL core, and only provide blocks of data when required. GDAL, besides, provides a block cache.[[BR]]
+The above scheme has a problem: you need a fixed block size. All the blocks of a given coverage have the same width and height. And the regularly blocked scheme is only one of the possible data arrangements PostGIS Raster provides.[[BR]]
+To allow all the possible data arrangements, the PostGIS Raster driver must go deeper on GDAL I/O system. It must implement its own version of IRasterIO method, at Dataset level. For two main reasons:
+  * IRasterIO is a GDALDataset method that allows I/O of an arbitrary size region of the data. This is useful in cases where the blocked I/O isn't the optimal way or even it isn't possible. This is the PostGIS Raster's case (withouht a fixed block size, there's no blocked I/O)
+  * Getting all the data in a given region (no matter how many raster rows are that) implies only one SQL query. And minimize the server round is interesting to PostGIS Raster (more speed, better performance)[[BR]]
+So, basically the psc for IRasterIO is right, but as GDAL is prepared for a blocked I/O with same size blocks and PostGIS Raster can't ensure all the coverage tiles have the same size, we must take a decission:
+  * Get rid of the GDAL block concept and simply get all the requested data in a query from IRasterIO method. That implies getting rid of GDAL block cache too.
+  * Adapt our ''messy'' arrangement to GDAL block system, by dividing the data fetched from database in equally sized blocks, and manually adding them to the cache. An important thing here is what happens when some raster data, stored in the cache, is modified. Using the GDAL cache system, you can simply replace the data in the cache and mark the block as ''dirty'' (using GDALRasterBlock::MarkDirty method), but... '''Q: Are the dirty blocks written to disk (database, in our case) by IWriteBlock method? If not, how?''' '''Response (jorgearevalo): Yes, IWriteBlock persists the blocks to database'''
+After a conversation between Pierre Racine and Jorge Arevalo, we agree on the approach of getting all data in IRasterIO, dividing it into equally sized blocks and storing them in the cache. Jorge Arevalo is working on it (September 2011).
+About the IReadBlock method:
+About the IReadBlock method (to be implemented in the rasterband class):
 {{{
 …
+}
 }}}
-We should get rid of it, for the reasons explained above. We aren't working with equally sized rasters, in general. And even in that case, getting all the requested data in one query with IRasterIO method is a better approach (faster).