= PostGIS WKT Raster Tutorial 1 - Intersecting vector buffers with large raster coverage using PostGIS WKT Raster = ''Pierre Racine, June 2010'' == Introduction == This tutorial will show you how to do a very classical raster/vector analysis with PostGIS WKT Raster. You will get more info on PostGIS WKT Raster [wiki:WKTRaster in this page] but we can say that WKT Raster... * is the raster format introduced by PostGIS * supports multiband, nodata value, georeference, overviews, overlapping tiles and non rectangular coverages * is not really limited in size (PostgreSQL has a limit of 32 TB) * is very well integrated with the existing PostGIS geometry type allowing for example seamless and efficient intersections operations with vector tables * comes with a very versatile Python raster loader (supporting batch loading through wildcards and as many input formats as GDAL does) * is not only a raster format but a SQL raster manipulation and analysis API Basically the problem introduced in this tutorial is to compute the mean elevation for buffers surrounding a series of point representing caribou observations. You can do this kind of analysis pretty easily in any GIS package but what is special here and not easy to do in ANY GIS package is the size of the datasets involved (900 MB of raster data), the simplicity of the queries and the speed at which we will get the results. We will describe the steps mainly on Windows, but Linux gurus will have no problem writing the equivalent commands in their favorite OS. We assume that PostgreSQL, PostGIS with WKT Raster are well installed. Refer to the readme file of each software to install them properly. The version used are: * [http://www.postgresql.org/ PostgreSQL 8.4] * [http://postgis.refractions.net/download/ PostGIS 1.5.1] * [http://trac.osgeo.org/postgis/wiki/WKTRaster WKT Raster]: We compiled the last code available. * GDAL 1.6: You will have to compile it before WKT Raster. * [http://www.python.org/download/ Python 2.6] or 2.5 * [http://download.osgeo.org/gdal/win32/1.6/gdalwin32exe160.zip GDAL 1.6 binaries] for GDAL for Python * [http://pypi.python.org/pypi/GDAL/1.6.1 GDAL 1.6 for Python 2.6] (or GDAL 1.6 for Python 2.5) * [http://sourceforge.net/projects/numpy/files/ Numpy 1.4 for Python 2.6] * [http://www.openjump.org/ OpenJUMP 1.3] We will use two different datasets: * The first dataset is a point shapefile of 814 caribou observations done by airplane in the northern part of the province of Quebec by the provincial ministry of natural ressources between 1999 and 2005. We will see later in the tutorial how you can easily produce a similar point cover. * The second dataset is a series of 13 SRTM elevation rasters covering most of the province of Quebec. These free rasters, available on the web, are 6000x6000 pixels each, have a pixel size of about 60 x 90 meters and weight a total of 900 MB once unzipped. There are many versions of SRTM. We choose the one at ftp://srtm.csi.cgiar.org/. You can find a pretty good description of this dataset in Wikipedia (http://en.wikipedia.org/wiki/SRTM). We will import the shapefile and the raster coverage into PostGIS, create buffers around each point and then intersect those buffers with the elevation coverage in order to get a weighted mean elevation for each buffer. We will have a graphical look at the different results along the way using OpenJUMP. == Loading the caribou points shapefile into PostGIS == This operation is very straightforward for any casual PostGIS user. There are many ways to import a shapefile into PostGIS. You can use the classical loader provided with PostGIS (shp2pgsql.exe) or the GUI one integrated into pgAdmin III. You can also use GDAL/OGR, OpenJUMP or QGIS. The classical PostGIS loader and GDAL/ORG are command line programs and PostGIS GUI, OpenJUMP and QGIS are graphical user interfaces (GUI). PostGIS WKT Raster does not have a GUI loader yet. It comes with a Python command line loader (gdal2wktraster.py, thanks to Mateusz Loskot) very similar to PostGIS shp2pgsql.exe. We will import both datasets using those loaders in order to appreciate the similitudes. Here is the procedure to import the caribou point shapefile: Launch a command prompt (or a shell under Linux) and make sure you can execute "shp2pgsql". If this works, you should get the help for the shp2pgsql command. If not, make sure the folder where it is located ("PostgreSQL_Installation_Path/bin") is in you PATH environment variable or simply type the complete path to shp2pgsql like this: {{{ >"C:/Program Files/PostgreSQL/8.4/bin/shp2pgsql" }}} shp2pgsql works in two steps: The first step is to create a .sql file that you then have to execute using "psql", an application included with PostgreSQL. To convert the caribou point shapefile in "cariboupoint.sql", we do the following: {{{ >"C:/Program Files/PostgreSQL/8.4/bin/shp2pgsql" -s 32198 -I C:\Temp\TutData\cariboupoints.shp > C:\Temp\TutData\cariboupoints.sql }}} The -s option tells the loader that the shapefile points are stored in the Quebec Lambert spatial reference system which has SRID number 32198 in the PostGIS spatial_ref_sys table. The -I tell the loader to add a line to create a spatial index on the table. The resulting sql commands are copied to the .sql file using the ">" pipe command. Assuming you have already created a database when you installed PostGIS and that you named it "tutorial01", you can then execute the sql file with "psql" like this: {{{ >"C:/Program Files/PostgreSQL/8.4/bin/psql" -f C:\Temp\TutData\cariboupoints.sql tutorial01 }}} == Visualizing the caribou point in OpenJUMP == The favorite tool of the PostGIS community to display geometries loaded into PostGIS is OpenJUMP. This java open source software is very easy to install and work very well with PostGIS. Once OpenJUMP have been installed, you can view the caribou points by following these steps: 1. Start OpenJUMP and select "Layer->Run Datastore Query" from the menu. 2. Create a connection to PostGIS by clicking the icon on the right of the connection field and adding a new connection with the "name" of your choice, the "localhost" server at port "5432", on your database instance (here "tutorial01") with your PostgreSQL "username" and "password". Click OK to establish the connection to PostGIS. 3. In the query field, type the following command: {{{ SELECT id, ST_AsBinary(the_geom) FROM cariboupoints; }}} PostGIS geometries have to be converted with the ST_Asbinary() function for OpenJUMP to be able to parse them properly. You should see your caribou observation points appear in the OpenJUMP map window. == Loading the SRTM elevation rasters into PostGIS WKT Raster == The process to load the elevation raster data is very similar. The name of the loader is gdal2wktraster.py. It is a Python program dependent on the Python package and the GDAL Python package. It is located in the same folder as shp2pgsql. You can call it directly or with its full path if you did not change your PATH environment variable. To display gdal2wktraster.py help, simply do: {{{ >"C:/Program Files/PostgreSQL/8.4/bin/gdal2wktraster.py" -h }}} We downloaded 13 zipped files. If you have 7-Zip installed, you can simply select all the zip files, right-click on them and select "7-Zip->Extract files..." 7-Zip will extract every TIF file in the same folder. Each file is 6000 pixels wide by 6000 pixels high. You can quickly preview the images if you have a good images viewer like !IrfanView installed. Efficient raster/vector analysis with PostGIS WKT Raster requires raster files to be split into tiles when loaded in the database. Fortunately gdal2wktraster.py has an option to tile the images the size we like. We will tile them into 100 pixels x 100 pixels resulting in 6000/100 * 6000/100 * 13 = 46800 tiles. gdal2wktraster.py also accepts wildcard so we will be able to load our 13 big rasters in one unique command. The command to load the raster files in PostGIS WKT Raster looks like: {{{ >"C:/Program Files/PostgreSQL/8.4/bin/gdal2wktraster.py" -r C:\Temp\TutData\SRTM\tif\*.tif -t srtm_tiled -s 4326 -k 100x100 -I > C:\Temp\TutData\SRTM\srtm.sql }}} The -r option indicate the series of raster we want to load in the database. You can use *, ?, and character ranges expressed with [] like in a Unix shell to match as many rasters as you wish. The -t option specify the table in which we want to load the raster coverage. The -s option, identical to shp2pgsql.exe one, is required to specify the spatial reference system ID. In this case the raster are in "WGS 84" having the SRID number 4326. Unlike some GIS software, PostGIS does not support on the fly reprojection so that we cannot do operations on table stored with different spatial reference systems. As we could see, the caribou point layer and the SRTM images are in different spatial reference systems. The points are in "NAD 83/Quebec Lambert" and SRTM are in "WGS 84". We will have to deal with this problem later. The -k option specify the size of the tiles we want to load in PostGIS. Every input raster will be split into 100x100 tiles. This dimension is a good compromise allowing efficient raster/vector analysis. It is better if the size of the tiles is a divider of the size of each raster. Otherwise the last colomns and rows of tiles of each raster will be filled with nodata values. This might have an impact on performance but not on the result since WKT Raster analysis functions ignore nodata values. The -I option tells the loader to create a spatial index on the raster tile table. The index is very important as it allow PostGIS WKT Raster to restrict his computing efforts only to the tiles involved in a spatial operation. In this tutorial for example, the intersection operations will be performed only on the tiles that actually intersects with the caribou points and it is much faster to search for those tiles if they are spatially indexed than try them one after the other sequentially in the raster table. The result of the gdal2wktraster.py command is a 1.8 GB .sql file produced in about one minute (on my brand new Lenovo X201 labtop - Intel Core i5, 1.17 GHz, 3GB of RAM :-). The same way we loaded the caribou point sql command file, we will load this sql file using "psql": {{{ >"C:/Program Files/PostgreSQL/8.4/bin/psql" -f C:\Temp\TutData\SRTM\srtm.sql tutorial01 }}} This process took less than 4 minutes. You can quickly verify the success of the loading operation by looking at the number of row present in the "srtm_tiled" table. There should be 46800 rows. If you forgot to add the -I option to gdal2wktraster.py, you can always add the index afterward by executing a SQL command similar to this one in pgAdmin III: {{{ CREATE INDEX srtm_tiled_rast_gist_idx ON srtm_tiled USING GIST (ST_ConvexHull(rast)); }}} A quick overview of the properties of the raster table (upper left corner coordinates, width and height, pixel sizes, skews - or rotations, SRID, number of band, pixel type, has nodata value, nodata value, is out db) can be displayed with the following command: {{{ SELECT (md).*, (bmd).* FROM (SELECT ST_Metadata(rast) AS md, ST_BandMetadata(rast) AS bmd FROM srtm_tiled LIMIT 1 ) foo; }}} We displayed the metadata for only one row (tile) but except for the upperleftx and upperlefty, every column should be identical for all the rows. You can then visualize the extent of each of those 46800 raster tiles by typing the following command in the OpenJUMP "Run Datastore Query" dialog: {{{ SELECT rid, ST_AsBinary(rast::geometry) FROM srtm_tiled; }}} As the rasters in the "srtm_tiled" table are in a different spatial reference system than the point you might not see then overlap with the caribou points. Right-clic on the added layer and select "Zoom To Layer" to adjust the display to their extent and see them. You can also view the complete area covered by the raster coverage as one unique geometry by doing this: {{{ SELECT ST_AsBinary(ST_Buffer(ST_Union(rast::geometry), 0.000001)) FROM srtm_tiled; }}} As you can see, unlike most raster coverage loaded into a GIS, the area covered by this single table raster coverage is not strictly rectangular and the two big missing square areas are not filled with nodata values. Here you have only one table for the whole coverage which is build from many raster files loaded in a single step. You don't need 13 tables to store 13 rasters. We did this with one band SRTM files in TIF format, but it could have been 3 bands BMP files or JPEG and the number of files and the total size of the coverage does not really matter (PostgreSQL has a limit of 32 terabytes)... If you want to go to the pixel level and verify the integrity of the values associated to a sample of the raster, you can (thanks to Jorge Arevalo and GDAL) view a vectorization of some tiles in OpenJUMP. Just type the following SQL query in the same query dialog: {{{ SELECT ST_AsBinary((ST_DumpAsPolygons(rast)).geom), (ST_DumpAsPolygons(rast)).val FROM srtm_tiled WHERE rid=34; }}} Vectorizing all the table would have been way too long and the huge amount of geometry would have been impossible to load in OpenJUMP (and most GIS software). First notice: some pixels are grouped together into complex polygons. ST_DumpAsPolygons() not only blindly dump each pixels, it actually vectorize the raster grouping areas with shared values together. In a raster with an low range of value this would result in significantly less polygones than there is pixels. Higher number of different value (like continuous floating point value rasters) would result in many, many little polygons, most of them square corresponding to only one pixel. A future ST_Reclass() unction should help reducing the number of value in a raster and give more analysis flexibility. For now you can always reclassify your raster BEFORE loading them into PostGIS WKT Raster. This query includes the value associated each group of pixel. You can display those values by right-clicking on the OpenJUMP layer, selecting "Change Styles...", selecting the "Labels" tab, "Enabling labelling" and selecting the "val" Label attribute. You will also notice that a small area in the upper left corner of the vectorized tile is missing. This is an area with nodata value pixels. Any analysis function in PostGIS WKT Raster takes the nodata value into account. This means that those areas will be omitted from the operation. For example in the upcoming intersection operation, if a buffer completely overlaps a nodata value area of one tile, st_intersects(rast, geom) will return false for this tile/buffer couple and st_intersection(rast, geom) will return an "EMPTY GEOMETRY". If the buffer partially overlap a nodata value area and intersect other parts of the tile, st_intersects(rast, geom) will return true for this tile/buffer couple and st_intersection(rast, geom) will only returns parts of the tile with values, omiting parts with nodata values. You can verify that your raster has a nodatavalues set and the effective value of the nodata value by doing this query in pgAdmin III: {{{ SELECT ST_BandHasNodataValue(rast), ST_BandNodataValue(rast) FROM srtm_tiled LIMIT 1; }}} If you want a query to ignore the nodata value and treat it like any other value of the raster, you can always replace the rast parameters with ST_SetBandHasNodataValue(rast, FALSE) like in this OpenJUMP query: {{{ SELECT ST_AsBinary((ST_DumpAsPolygons(ST_SetBandHasNodataValue(rast, FALSE))).geom), (ST_DumpAsPolygons(ST_SetBandHasNodataValue(rast, FALSE))).val FROM srtm_tiled WHERE rid=34; }}} == Producing a random point table similar to the caribou shapefile == Now that we know the extent of our elevation data, we can produce a table similar to the caribou_point one. For this we will use a function that generate a certain number of point geometry inside a polygon. Copy this function in the pgAdmin III query window and execute it: {{{ CREATE OR REPLACE FUNCTION ST_RandomPoints(geom geometry, nb int) RETURNS SETOF geometry AS $$ DECLARE pt geometry; xmin float8; xmax float8; ymin float8; ymax float8; xrange float8; yrange float8; srid int; count integer := 0; bcontains boolean := FALSE; gtype text; BEGIN SELECT ST_GeometryType(geom) INTO gtype; IF ( gtype != 'ST_Polygon' ) AND ( gtype != 'ST_MultiPolygon' ) THEN RAISE EXCEPTION 'Attempting to get random point in a non polygon geometry'; END IF; SELECT ST_XMin(geom), ST_XMax(geom), ST_YMin(geom), ST_YMax(geom), ST_SRID(geom) INTO xmin, xmax, ymin, ymax, srid; SELECT xmax - xmin, ymax - ymin INTO xrange, yrange; WHILE count < nb LOOP SELECT ST_SetSRID(ST_MakePoint(xmin + xrange * random(), ymin + yrange * random()), srid) INTO pt; SELECT ST_Contains(geom, pt) INTO bcontains; IF bcontains THEN count := count + 1; RETURN NEXT pt; END IF; END LOOP; RETURN; END; $$ LANGUAGE 'plpgsql' IMMUTABLE STRICT; }}} You can then generate a fake cariboupoints table, in the same spatial reference system (NAD 83/Quebec Lambert) and with 814 points like the original one doing this query: {{{ CREATE TABLE cariboupoints AS SELECT generate_series(1, 814) AS id, ST_Transform(ST_RandomPoints(the_geom, 814), 32198) AS the_geom FROM (SELECT ST_SetSRID(ST_Extent(rast::geometry), 4326) AS the_geom FROM srtm_tiled ) foo; }}} Similarly you can visualize your own caribou point layer with this OpenJUMP query: {{{ SELECT id, ST_AsBinary(the_geom) FROM cariboupoints; }}} This fake caribou layer is distributed very differently than our original one and does not really look like a caribou distribution but this does not matter for the rest of the tutorial. Now that we have our two coverages loaded in the database and that we are confident about their values and spatial reference, let's start doing some analysis queries... == Making buffers around the caribou points == This operation is also very straightforward for any casual PostGIS user. The caribou point table is in a spatial reference system suitable for measurements in meters so we can build our buffers in this system and then convert them to the raster one later in order to intersects them. We will make 1 km buffers around our points and save them as a new table. We do this operation in a pgAdmin III query window (not in OpenJUMP): {{{ CREATE TABLE cariboupoint_buffers AS SELECT id, ST_Buffer(the_geom, 1000) AS the_geom FROM cariboupoints; }}} Again you can display your buffers in OpenJUMP with this query: {{{ SELECT id, ST_AsBinary(the_geom) FROM cariboupoint_buffers; }}} We then reproject our buffers to WGS 84 so we can intersect them with our raster coverage: {{{ CREATE TABLE cariboupoint_buffers_wgs AS SELECT id, ST_Transform(the_geom, 4326) AS the_geom FROM cariboupoint_buffers; }}} Again the same kind of query to display them in OpenJUMP and confirm that they overlap with the raster coverage and see that now they are a bit squashed: {{{ SELECT id, ST_AsBinary(the_geom) FROM cariboupoint_buffers_wgs; }}} We could have done those two queries in one: {{{ CREATE TABLE cariboupoint_buffers_wgs AS SELECT id, ST_Transform(ST_Buffer(the_geom, 1000), 4326) AS the_geom FROM cariboupoints; }}} We then create a spatial index on this last table to make sure the next intersection operation will be done as quickly as possible: {{{ CREATE INDEX cariboupoint_buffers_wgs_geom_idx ON cariboupoint_buffers_wgs USING GIST (the_geom); }}} == Intersecting the caribou buffers with the elevation rasters == Our two coverages are now ready to be intersected. We will use the st_intersection() function and st_intersects() operator which operates directly on raster and geometries by vectorizing only the necessary part of the raster before doing the intersection and is one of the main feature of PostGIS WKT Raster: {{{ CREATE TABLE caribou_srtm_inter AS SELECT id, (st_intersection(rast, the_geom)).geom AS the_geom, (st_intersection(rast, the_geom)).val FROM cariboupoint_buffers_wgs, srtm_tiled WHERE st_intersects(rast, the_geom); }}} This query takes about 30 minutes to complete. It can greatly be optimized by invoquing the st_intersection only once with the help of a subquery producing record values that we split later in the main query like this: {{{ CREATE TABLE caribou_srtm_inter AS SELECT id, (gv).geom AS the_geom, (gv).val FROM (SELECT id, ST_Intersection(rast, the_geom) AS gv FROM srtm_tiled, cariboupoint_buffers_wgs WHERE st_intersects(rast, the_geom) ) foo; }}} This version takes only 17 minutes. Almost half the time spent by the preceding one... This little increase in complexity certainly worths the gain in performance. The size of the tiles has also a major effect here. If we restart the whole process with 60 pixels x 60 pixels tiles (60 is a divider of 6000) the intersection operation takes 9 minutes. Once again you can visualise the result of the query by doing this query in OpenJUMP (you might have to increase RAM allowed to OpenJUMP by changing "-Xmx256M" to "-Xmx1024M" in "OpenJUMPInstallPath/bin/openjump.bat"): {{{ SELECT id, val, ST_AsBinary(the_geom) FROM caribou_srtm_inter; }}} One approach to make this operation without PostGIS WKT Raster would have been to first convert our 13 rasters to 13 shapefiles, import them in PostGIS and do a traditional intersection query between geometries only. The problem is that converting only one of those SRTM file is very difficult... The vector version of only one of those files exceed the limit of 2 GB imposed on shapefiles and the GML equivalent file is more than 10 GB... The WKT Raster approach goes like this: "since it is mostly impossible to convert those raster to vector, load them all into PostGIS as they are and only the smallest part of them will be vectorized to do the requested operation", this using the very efficient GiST index, the versatility of GDAL and the power of the postGIS vector engine. == Summarizing the elevation values for each buffer and exporting the table to CSV == The last step of our analysis is to summarize the elevation for each buffer by computing 814 weighted mean elevations. "Weighted mean elevation" means that if the area occupied by pixels with an elevation equal to 234 m is greater than the area occupy pixels with an elevation equal to 56 m, then 234 has a higher weight in the equation (proportional to it relative area).This query should do the work: {{{ CREATE TABLE result01 AS SELECT id, sum(st_area(ST_Transform(the_geom, 32198)) * val) / sum(st_area(ST_Transform(the_geom, 32198))) AS meanelev FROM caribou_srtm_inter GROUP BY id ORDER BY id; }}} Finally if you want to export the resulting table to a CSV file you can do the following command in the same pgAdmin III query window: {{{ COPY result01 TO 'C:/temp/tutdata/result01.csv' WITH DELIMITER ',' CSV HEADER; }}} == Conclusion == We have used the most powerfull vector database engine around (PostGIS) to intersect our data buffers with a big raster datasets. For those who already know PostGIS, most of the queries are very familiar. For the newcomers, they might appear a bit complicated, but the speed at which we get the resulting table is worth the effort. The longuest step was to load the data. Once this is done, we can make as many quick query as we wish. PostGIS with WKT Raster is a serious tool for accomplishing serious analysis work. You can find more info on PostGIS WKT Raster [wiki:WKTRaster in this page]. All the functions are documented in the [http://postgis.refractions.net/documentation/manual-svn/RT_reference.html WKT Raster Reference].