PostGIS WKT Raster Tutorial 1 - Intersecting vector polygons with large raster coverage using PostGIS WKT Raster

Pierre Racine, June 2010

Introduction

This tutorial will show you how to do a very classical raster/vector overlay analysis with PostGIS WKT Raster. You will get more info on PostGIS WKT Raster in this page but we can say that WKT Raster...

  • is the raster format introduced by PostGIS
  • supports multiband, nodata value, georeference, overviews, overlapping tiles and non rectangular coverages
  • is not really limited in size (PostgreSQL has a limit of 32 TB)
  • is very well integrated with the existing PostGIS geometry type allowing for example seamless and efficient intersections operations with vector tables
  • comes with a very versatile Python raster loader (supporting batch loading through wildcards and as many input formats as GDAL does)
  • is not only a raster format but a SQL raster manipulation and analysis API

NOTE: Since its integration into PostGIS 2.0, PostGIS WKT Raster is now simply called "PostGIS raster". Note also that "gdal2wktraster.py" has been renamed "raster2pgsql.py" in PostGIS 2.0.

Basically the problem introduced in this tutorial is to compute the mean elevation for buffers surrounding a series of point representing caribou observations. You can do this kind of analysis pretty easily in any GIS package but what is special here and not easy to do in ANY GIS package is the size of the datasets involved (900 MB of raster data), the simplicity of the queries and the speed at which we will get the results.

We will describe the steps mainly on Windows, but Linux gurus will have no problem writing the equivalent commands in their favorite OS. We assume that PostgreSQL, PostGIS with WKT Raster are well installed. Refer to the readme file of each software to install them properly. The software versions used are:

We will use two different datasets:

  • The first dataset is a point shapefile of 814 caribou observations done by airplane in the northern part of the province of Quebec by the provincial ministry of natural ressources between 1999 and 2005. We do not provide this file. We will show later in the tutorial how you can easily produce a similar random point cover. You can also use your set of point.
  • The second dataset is a series of 13 SRTM elevation rasters covering most of the province of Quebec. These free rasters, available on the web, are 6000x6000 pixels each, have a pixel size of about 60 x 90 meters and weight a total of 900 MB once unzipped. There are many versions of SRTM. We choose the one at  ftp://ftp.ciat.cgiar.org/DAPA/edarague/P03_30x30_cut/TIFFs/ . You do not have to download the same files we used. Just download ten of them in the area of your choice. You can find a pretty good description of this dataset in Wikipedia ( http://en.wikipedia.org/wiki/SRTM).

We will import the shapefile and the raster coverage into PostGIS, create buffers around each point and then intersect those buffers with the elevation coverage in order to get a weighted mean elevation for each buffer. We will have a graphical look at the different results along the way using OpenJUMP.

Loading the caribou points shapefile into PostGIS

We describe these steps for those who want to use their own point dataset and who do not know PostGIS. If you don't have your own dataset, you can jump directly to the next section.

This operation is very straightforward for any casual PostGIS user. There are many ways to import a shapefile into PostGIS. You can use the classical loader provided with PostGIS (shp2pgsql.exe) or the GUI one integrated into pgAdmin III. You can also use GDAL/OGR, OpenJUMP or QGIS. The classical PostGIS loader and GDAL/ORG are command line programs and PostGIS GUI, OpenJUMP and QGIS are graphical user interfaces (GUI). PostGIS WKT Raster does not have a GUI loader yet. It comes with a Python command line loader (gdal2wktraster.py, thanks to Mateusz Loskot) very similar to PostGIS shp2pgsql.exe. We will import both datasets using those loaders in order to appreciate the similitudes.

Here is the procedure to import the caribou point shapefile:

Launch a command prompt (or a shell under Linux) and make sure you can execute "shp2pgsql". If this works, you should get the help for the shp2pgsql command. If not, make sure the folder where it is located ("PostgreSQL_Installation_Path/bin") is in you PATH environment variable or simply type the complete path to shp2pgsql like this:

 >"C:/Program Files/PostgreSQL/8.4/bin/shp2pgsql"

shp2pgsql works in two steps: The first step is to create a .sql file that you then have to execute using "psql", an application included with PostgreSQL. To convert the caribou point shapefile in "cariboupoint.sql", we do the following:

 >shp2pgsql -s 32198 -I C:\Temp\TutData\cariboupoints.shp > 
        C:\Temp\TutData\cariboupoints.sql

The -s option tells the loader that the shapefile points are stored in the Quebec Lambert spatial reference system which has SRID number 32198 in the PostGIS spatial_ref_sys table.

The -I tell the loader to add a "CREATE INDEX" SQL command which will create a spatial index on the table.

The resulting sql commands are copied to the .sql file using the ">" pipe command.

Assuming you have already created a database when you installed PostGIS and that you named it "tutorial01", you can then execute the sql file with "psql" like this:

 >psql -f C:\Temp\TutData\cariboupoints.sql tutorial01

Visualizing the caribou point in OpenJUMP

The favorite tool of the PostGIS community to display geometries loaded into PostGIS is OpenJUMP. This java open source software is very easy to install and work very well with PostGIS. Once OpenJUMP have been installed, you can view the caribou points by following these steps:

  1. Start OpenJUMP and select "Layer->Run Datastore Query" from the menu.
  1. Create a connection to PostGIS by clicking the icon on the right of the connection field and adding a new connection with the "name" of your choice, the "localhost" server at port "5432", on your database instance (here "tutorial01") with your PostgreSQL "username" and "password". Click OK to establish the connection to PostGIS.
  1. In the query field, type the following command:
 SELECT id, ST_AsBinary(the_geom)
 FROM cariboupoints;

PostGIS geometries have to be converted with the ST_Asbinary() function for OpenJUMP to be able to parse them properly. You should see your caribou observation points appear in the OpenJUMP map window.

The side image show two views, one full extent and one zoomed, on the 814 points scatered on the width of the province of Quebec with the SRTM coverage that we are about to load in the database.

Loading the SRTM elevation rasters into PostGIS WKT Raster

The process to load the elevation raster data is very similar. The name of the loader is gdal2wktraster.py. It is a Python program dependent on the Python package and the GDAL Python package. It is located in the same folder as shp2pgsql. You can call it directly or with its full path if you did not change your PATH environment variable.

To display gdal2wktraster.py help, simply do:

 >gdal2wktraster.py -h

We downloaded 13 zipped files, the ones covering our caribou dataset. Since we don't provide you with the caribou points (you will generate them, randomly yourself), you can download the SRTM files you want. Download about ten of them for a real example. Once downloaded, if you have 7-Zip installed, you can simply select all the zip files, right-click on them and select "7-Zip->Extract files..." 7-Zip will extract every TIF file in the same folder. Each file is 6000 pixels wide by 6000 pixels high. You can quickly preview the images if you have a good images viewer like IrfanView installed.

Efficient raster/vector analysis with PostGIS WKT Raster requires raster files to be split into tiles when loaded in the database. Fortunately gdal2wktraster.py has an option to tile the images the size we like. We will tile them into 50 pixels x 50 pixels resulting in 6000/50 * 6000/50 * 13 = 187200 tiles. gdal2wktraster.py also accepts wildcard so we will be able to load our 13 big rasters in one unique command.

The command to load the raster files in PostGIS WKT Raster looks like:

 >gdal2wktraster.py -r C:\Temp\TutData\SRTM\tif\*.tif -t srtm_tiled 
      -s 4326 -k 50x50 -I > C:\Temp\TutData\SRTM\srtm.sql

The -r option indicate the series of raster we want to load in the database. You can use *, ?, and character ranges expressed with [] like in a Unix shell to match as many rasters as you wish.

The -t option specify the table in which we want to load the raster coverage.

The -s option, identical to shp2pgsql.exe one, is required to specify the spatial reference system ID. In this case the raster are in "WGS 84" having the SRID number 4326. Unlike some GIS software, PostGIS does not support on the fly reprojection so that we cannot do operations on table stored with different spatial reference systems. As we could see, the caribou point layer and the SRTM images are in different spatial reference systems. The points are in "NAD 83/Quebec Lambert" and SRTM are in "WGS 84". We will have to deal with this problem later.

The -k option specify the size of the tiles we want to load in PostGIS. Every input raster will be split into 50x50 pixels tiles. This dimension is a good compromise allowing efficient raster/vector analysis. It is better if the size of the tiles is a divider of the size of each raster. Otherwise the last colomns and rows of tiles of each raster will be filled with nodata values. This might have an impact on performance but not on the result since WKT Raster analysis functions ignore nodata values.

The -I option tells the loader to create a spatial index on the raster tile table. The index is very important as it allow PostGIS WKT Raster to restrict his computing efforts only to the tiles involved in a spatial operation. In this tutorial for example, the intersection operations will be performed only on the tiles that actually intersects with the caribou points and it is much faster to search for those tiles if they are spatially indexed than try them one after the other sequentially in the raster table.

The result of the gdal2wktraster.py command is a 1.8 GB .sql file produced in about one minute (on my brand new Lenovo X201 laptop - Intel Core i5, 1.17 GHz, 3GB of RAM :-).

The same way we loaded the caribou point sql command file, we will load this sql file using "psql":

 >psql -f C:\Temp\TutData\SRTM\srtm.sql tutorial01

This process took less than 4 minutes. You can quickly verify the success of the loading operation by looking at the number of row present in the "srtm_tiled" table. There should be 187200 rows.

If you forgot to add the -I option to gdal2wktraster.py, you can always add the index afterward by executing a SQL command similar to this one in pgAdmin III:

 CREATE INDEX srtm_tiled_rast_gist_idx 
 ON srtm_tiled USING GIST (ST_ConvexHull(rast));

A quick overview of the properties of the raster table (upper left corner coordinates, width and height, pixel sizes, skews - or rotations, SRID, number of band, pixel type, has nodata value, nodata value, is out db) can be displayed with the following command:

 SELECT (md).*, (bmd).* 
 FROM (SELECT ST_Metadata(rast) AS md, 
              ST_BandMetadata(rast) AS bmd 
       FROM srtm_tiled LIMIT 1
      ) foo;

We displayed the metadata for only one row (tile) but except for the upperleftx and upperlefty, every column should be identical for all the rows.

You can then visualize the extent of each of those 187200 raster tiles by typing the following command in the OpenJUMP "Run Datastore Query" dialog:

 SELECT rid, ST_AsBinary(rast::geometry) 
 FROM srtm_tiled;

"rast::geometry" cast the raster envelope (actually the convex hull) to a geometry thus making OpenJump? to display 187200 polygons.

As the rasters in the "srtm_tiled" table are in a different spatial reference system than the point you might not see them overlap with the caribou points. Right-clic on the added layer and select "Zoom To Layer" to adjust the display to their extent and see them.

You can also view the complete area covered by the raster coverage as one unique geometry by doing this query:

 SELECT ST_AsBinary(ST_Buffer(ST_Union(rast::geometry), 0.000001)) 
 FROM srtm_tiled;

ST_Union assemble all the raster envelope into one unique polygon and ST_Buffer make sure it become a simple polygon.

As you can see, unlike most raster coverage loaded into a GIS, the area covered by this single table raster coverage is not strictly rectangular and the two big missing square areas are not filled with nodata values. Here you have only one table for the whole coverage which is build from many raster files loaded in a single step. You don't need 13 tables to store 13 rasters. We did this with one band SRTM files in TIF format, but it could have been 3 bands BMP files or JPEG and the number of files and the total size of the coverage does not really matter (PostgreSQL has a limit of 32 terabytes)...

If you want to go to the pixel level and verify the integrity of the values associated to a sample of the raster, you can (thanks to Jorge Arevalo and GDAL) view a vectorization of some tiles in OpenJUMP (as in the right side image) using the ST_DumpAsPolygons function. Just type the following SQL query in the same query dialog:

 SELECT ST_AsBinary((ST_DumpAsPolygons(rast)).geom), 
        (ST_DumpAsPolygons(rast)).val
 FROM srtm_tiled
 WHERE rid=1869;

ST_DumpAsPolygons returns polygons representing raster area sharing a single value and the associated value as a complex "geomval" object. The first column returned by this query is the geometry and the second the associated value.

Vectorizing all the table would have been way too long and the huge number of geometry would have been impossible to load in OpenJUMP (and most GIS software), this is why we limit the conversion to the tile number 3278.

First notice: some pixels are grouped together into complex polygons. ST_DumpAsPolygons() not only blindly dump each pixels, it actually vectorize the raster grouping areas with shared values together. In a raster having a low range of values, this would result in a significantly lower number of polygons than the number of pixels. A higher number of different values (like continuous floating point value rasters) would result in many, many little polygons, most of them square corresponding to only one pixel. A future ST_Reclass() function should help reducing the number of values in a raster and give more analysis flexibility. For now you can always reclassify your raster BEFORE loading them into PostGIS WKT Raster.

This query includes the value associated each group of pixel. You can display those values by right-clicking on the OpenJUMP layer, selecting "Change Styles...", selecting the "Labels" tab, "Enabling labelling" and selecting the "val" Label attribute.

You will also notice that a small area in the upper left corner of the vectorized tile is missing. This is an area with nodata value pixels. Any analysis function in PostGIS WKT Raster takes the nodata value into account. This means that those areas will be omitted from the operation. For example in the upcoming intersection operation, if a buffer completely overlaps a nodata value area of one tile, st_intersects(rast, geom) will return false for this tile/buffer couple and st_intersection(rast, geom) will return an "EMPTY GEOMETRY". If the buffer partially overlap a nodata value area and intersect other parts of the tile, st_intersects(rast, geom) will return true for this tile/buffer couple and st_intersection(rast, geom) will only returns parts of the tile with values, omiting parts with nodata values.

You can verify that your raster has a nodatavalues set and the effective value of the nodata value by doing this query in pgAdmin III:

 SELECT ST_BandHasNodataValue(rast), ST_BandNodataValue(rast)
 FROM srtm_tiled
 LIMIT 1;

If you want a query to ignore the nodata value and treat it like any other value of the raster, you can always replace the rast parameters with ST_SetBandHasNodataValue(rast, FALSE) like in this OpenJUMP query:

 SELECT ST_AsBinary(
         (ST_DumpAsPolygons(ST_SetBandHasNodataValue(rast, FALSE))).geom), 
         (ST_DumpAsPolygons(ST_SetBandHasNodataValue(rast, FALSE))).val
 FROM srtm_tiled
 WHERE rid=1869;

NOTE: In PostGIS 2.0 the "ST_BandNodataValue(rast)" and "ST_SetBandNodataValue(rast, nodatavalue)" support the NULL value so that the "ST_BandHasNodataValue(rast)" and the "ST_SetBandHasNodataValue(rast, bool)" functions are now respectively replaced by "NOT ST_BandNodataValue(rast) IS NULL" and "ST_SetBandNodataValue(rast, NULL)". The two previous queries in PostGIS 2.0 would be:

 SELECT NOT ST_BandNodataValue(rast) IS NULL, ST_BandNodataValue(rast)
 FROM srtm_tiled
 LIMIT 1;

and

SELECT ST_AsBinary(
         (ST_DumpAsPolygons(ST_SetBandNodataValue(rast, NULL))).geom), 
         (ST_DumpAsPolygons(ST_SetBandNodataValue(rast, NULL))).val
 FROM srtm_tiled
 WHERE rid=1869;

Producing a random point table similar to the caribou shapefile

Now that we know the extent of our elevation data, we can produce a table similar to the caribou_point one. For this we will use a function that generate a certain number of point geometry inside a polygon. Copy this function in the pgAdmin III query window and execute it:

 CREATE OR REPLACE FUNCTION ST_RandomPoints(geom geometry, nb int) 
 RETURNS SETOF geometry AS
 $$
 DECLARE
     pt geometry;
     xmin float8;
     xmax float8;
     ymin float8;
     ymax float8;
     xrange float8;
     yrange float8;
     srid int;
     count integer := 0;
     bcontains boolean := FALSE;
     gtype text;
 BEGIN
     SELECT ST_GeometryType(geom)
     INTO gtype;
     IF ( gtype != 'ST_Polygon' ) AND ( gtype != 'ST_MultiPolygon' ) THEN
         RAISE EXCEPTION 'Attempting to get random point in a non polygon geometry';
     END IF;
     SELECT ST_XMin(geom), ST_XMax(geom), ST_YMin(geom), ST_YMax(geom), ST_SRID(geom)
     INTO xmin, xmax, ymin, ymax, srid;
     SELECT xmax - xmin, ymax - ymin
     INTO xrange, yrange;
     WHILE count < nb LOOP
         SELECT ST_SetSRID(ST_MakePoint(xmin + xrange * random(), ymin + yrange * random()), srid)
         INTO pt;
         SELECT ST_Contains(geom, pt)
         INTO bcontains;
         IF bcontains THEN
             count := count + 1;
             RETURN NEXT pt;
         END IF;
     END LOOP;
     RETURN;
 END;
 $$
 LANGUAGE 'plpgsql' IMMUTABLE STRICT;

You can then generate a fake cariboupoints table, in the same spatial reference system (NAD 83/Quebec Lambert) and with 814 points like the original one doing this query:

 CREATE TABLE cariboupoints AS
 SELECT generate_series(1, 814) AS id, 
        ST_Transform(ST_RandomPoints(the_geom, 814), 32198) AS the_geom
 FROM (SELECT ST_SetSRID(ST_Extent(rast::geometry), 4326) AS the_geom
       FROM srtm_tiled
      ) foo;

Similarly you can visualize your own caribou point layer with this OpenJUMP query:

 SELECT id, ST_AsBinary(ST_Transform(the_geom, 4326))
 FROM cariboupoints;

Note that we have to reproject the point to the same reference system our raster table is stored in. This fake caribou layer is distributed very differently than our original one and does not really look like a caribou distribution but this does not matter for the rest of the tutorial.

Now that we have our two coverages loaded in the database and that we are confident about their values and spatial reference, let's start doing some analysis queries...

Making buffers around the caribou points

This operation is also very straightforward for any casual PostGIS user. The caribou point table is in a spatial reference system suitable for measurements in meters (32198) so we can build our buffers in this system and then convert them to the raster one (4326) later in order to intersects them. We will make 1 km buffers around our points and save them as a new table. We do this operation in a pgAdmin III query window (not in OpenJUMP):

 CREATE TABLE cariboupoint_buffers AS 
 SELECT id, ST_Buffer(the_geom, 1000) AS the_geom
 FROM cariboupoints;

Again you can display your buffers in OpenJUMP with this query:

 SELECT id, ST_AsBinary(ST_Transform(the_geom, 4326)) 
 FROM cariboupoint_buffers;

The buffers have and oval shape because they are displayed in a non-projected spatial reference system (WGS 84) to fit with the raster layer. In the database they were created in a projected coordinate system (NAD 83/Quebec Lambert) and are truly circle all of equal area. You can observe that southern buffers, again because they are displayed in a non-projected spatial reference system, are smaller (less wide) than the northern ones. If we would display everything in NAD 83/Quebec Lambert, they would all be of the same circle shape. But we can't do this because our raster are in WGS 84 and PostGIS do not support raster reprojection yet. We then reproject our buffers to WGS 84 as a new table in order to intersect them with our raster coverage:

 CREATE TABLE cariboupoint_buffers_wgs AS 
 SELECT id, ST_Transform(the_geom, 4326) AS the_geom
 FROM cariboupoint_buffers;

Again the same kind of query to display them in OpenJUMP and confirm that they overlap with the raster coverage without having to reproject them:

 SELECT id, ST_AsBinary(the_geom) 
 FROM cariboupoint_buffers_wgs;

We could have done those two queries in one:

 CREATE TABLE cariboupoint_buffers_wgs AS 
 SELECT id, ST_Transform(ST_Buffer(the_geom, 1000), 4326) AS the_geom
 FROM cariboupoints;

We then create a spatial index on this last table to make sure the next intersection operation will be done as quickly as possible:

 CREATE INDEX cariboupoint_buffers_wgs_geom_idx ON cariboupoint_buffers_wgs USING GIST (the_geom);

Intersecting the caribou buffers with the elevation rasters

Our two coverages are now ready to be intersected. We will use the st_intersection() function and st_intersects() operator which operates directly on raster and geometries by vectorizing only the necessary part of the raster before doing the intersection and is one of the main feature of PostGIS WKT Raster:

 CREATE TABLE caribou_srtm_inter AS
 SELECT id, 
        (ST_Intersection(rast, the_geom)).geom AS the_geom, 
        (ST_Intersection(rast, the_geom)).val
 FROM cariboupoint_buffers_wgs, 
      srtm_tiled
 WHERE ST_Intersects(rast, the_geom);

This query takes about 13 minutes to complete. It can greatly be optimized by invoquing the st_intersection only once with the help of a subquery producing record values that we split later in the main query like this:

 CREATE TABLE caribou_srtm_inter AS
 SELECT id, 
        (gv).geom AS the_geom, 
        (gv).val
 FROM (SELECT id, 
              ST_Intersection(rast, the_geom) AS gv
       FROM srtm_tiled,
            cariboupoint_buffers_wgs
       WHERE ST_Intersects(rast, the_geom)
      ) foo;

As its geometry/geometry sister, ST_Intersects(geometry, raster, band) returns TRUE if the withvalue area of a raster or a raster tile intersects a geometry and ST_Intersection(geometry, raster, band) returns the geometry/value set of geometries representing the intersection between the geometry and each polygonized group of pixel sharing a same value from the raster and its associated value. It is important to note here that ST_Intersects and ST_Intersection take the nodata value into account. That means if a polygon fall entirely in a nodata value area of a raster, ST_Intersects return FALSE and ST_Intersection returns an EMPTY GEOMETRY. This is a very important raster analysis feature.

The second version of the query takes only 7.4 minutes to complete. Almost half the time spent by the first one... This little increase in complexity is certainly worth the gain in performance. The size of the tiles has also a major effect here. If we restart the whole process with 200 pixels x 200 pixels tiles the intersection operation takes 47 minutes. 100 pixels x 100 pixels: 17 minutes... 30 pixels x 30 pixels: 5 minutes. On the other side, more tiles there is to write in the sql file, longer the loading process. You choose... Smaller tiles is generally a better choice.

Once again you can visualise the result of the query as in the right side image by doing this query in OpenJUMP (you might have to increase RAM allowed to OpenJUMP by changing "-Xmx256M" to "-Xmx1024M" in "OpenJUMPInstallPath/bin/openjump.bat"):

 SELECT id, val, ST_AsBinary(the_geom) 
 FROM caribou_srtm_inter;

One approach to make this operation without PostGIS WKT Raster would have been to first convert our 13 rasters to 13 shapefiles, import them in PostGIS and do a traditional intersection query between geometries only. The problem is that converting only one of those SRTM file is very difficult... The vector version of only one of those files exceed the limit of 2 GB imposed on shapefiles and the GML equivalent file is more than 10 GB... (130 GB for the whole coverage). The WKT Raster approach goes like this: "since it is mostly impossible to convert those rasters to vector, load them all into PostGIS as they are and WKT Raster will vectorized only the what is needed to do the requested operation", this using the very efficient GiST index, the versatility of GDAL and the power of the postGIS vector engine.

Summarizing the elevation values for each buffer and exporting the table to CSV

The last step of our analysis is to summarize the elevation for each buffer by computing 814 weighted mean elevations. "Weighted mean elevation" means that if the area occupied by pixels with an elevation equal to 234 m is greater than the area occupy pixels with an elevation equal to 56 m, then 234 has a higher weight in the equation (proportional to its relative area). We also have to convert all the geometries resulting from the intersection operation to a projected spatial reference system appropriate for area measurements, in this case the same one in which was the caribou layer originally (Quebec Lambert, SRID 32198). This query should do the work:

 CREATE TABLE result01 AS
 SELECT id, 
        sum(ST_Area(ST_Transform(the_geom, 32198)) * val) / 
        sum(ST_Area(ST_Transform(the_geom, 32198))) AS meanelev
 FROM caribou_srtm_inter
 GROUP BY id
 ORDER BY id;

Finally if you want to export the resulting table to a CSV file you can do the following command in the same pgAdmin III query window:

 COPY result01 TO 'C:/temp/tutdata/result01.csv' 
 WITH DELIMITER ',' CSV HEADER;

Conclusion

We have used the most powerful vector database engine around (PostGIS) to intersect our data buffers with a big raster datasets. For those who already know PostGIS, most of the queries are very familiar. For the newcomers, they might appear a bit complicated, but the speed at which we get the resulting table is worth the effort. The longuest step was to load the data. Once this is done, we can make as many quick queries as we wish. PostGIS with WKT Raster is a serious tool for accomplishing serious analysis work.

You can find more info on PostGIS WKT Raster in this page. All the functions are documented in the  WKT Raster Reference,  WKT Raster FAQ Reference

Attachments