wiki:PostGIS_Raster_SoC_Idea_2012/Distance_Analysis_Tools

Version 43 (modified by qliu, 13 years ago) ( diff )

PostGIS Raster SoC Idea 2012 - Distance Analysis Tools for PostGIS Raster


Student:  Qing Liu


Mentor: Pierre Racine


Idea:
The idea for this proposed project is to add two spatial analysis functions to PostGIS Raster that implement two main ways of performing distance analysis: Euclidean distance calculation and cost-weighted distance calculation.

Euclidean distance function will create a distance surface representing the Euclidean distance from each cell in the source layer to the starting point or the closest source (as designated by user). The basic concepts of the algorithm would be using the center of the source cell to calculate the distance from it to the rest cells in the raster layer or within the user-defined extent.

Cost-weighted distance will create a raster layer representing the accumulative cost distance for each cell to the starting point or the closest source (as designated by user). A cost raster layer will be generated using one or several factor layers representing the impedance of passing through each cell according to user’s criteria. User can also put weights on the input factor layers to represent different levels of importance for those factors. Factors such as: elevation, slope, orientation, land use / land cover type; vehicle speed, accessibility; and a binary map layer could also be used as a mask for defining geographic extent or as a filter for the output cost layer. The accumulative cost values will be then assigned to each cell representing the cost per unit distance for moving through this cell.

GSoC 2012 Proposal:
http://google-melange.appspot.com/gsoc/proposal/review/google/gsoc2012/kya_na/1



Weekly Reports




   
back to top

Week 1 Report

What did you get done this week?

  • Compiled PostGIS 2.0 sucessfully.
  • Loaded raster data into PostgreSQL and practiced some query by finishing the raster/vector tutorial.
  • Wrote an analysis of how Euclidean distance and Cost-weighted distance are computed in ArcGIS and GRASS.
  • Setup wiki page.

What do you plan on doing next week?

  • Learn more about PostGIS Raster concepts and functions
  • Write an analysis of how Euclidean distance and Cost-weighted distance are computed in Oracle
  • Start to write a proposal on how to adopt the concepts and algorithms in PostGIS.

Are you blocked on anything?

  • As of now I was not blocked on anything yet, but working in a spatial database is something new and challenging to me. I will need to read and reserch more about it.
  • However, it took me some time to understand how raster coverage is stored in PostGIS, and how Raster type works.


Analysis Report

How Euclidean distance and Cost-weighted distance are computed in ArcGIS and GRASS

  • Euclidean Distance
    • ArcGIS
      • Concepts:
        The Euclidean Distance generates a raster layer that contains the measured distance from every cell to the nearest source. http://help.arcgis.com/en/arcgisdesktop/10.0/help/009z/GUID-7D94981B-1DBB-40F9-9F3F-B025C9F52BCD-web.png (Illustration from ArcGIS Online Help)
      • Function:
        EucDistance (in_source_data, {maximum_distance}, {cell_size}, {out_direction_raster})
        • in_source_data: Raster Layer / Feature Layer (vector)
          • Identifies the cells(raster) or locations(feature) to which the Euclidean distance for every output cell location is calculated.
          • If the source is a raster layer, it must contain only the values of the source cells, while other cells must be NoData.
          • If the source is a feature layer, it will internally be transformed into a raster. The resolution of the converted raster can be controlled with the output cell_size parameter.
        • maximum_distance(Optional): Double Precision
          • Defines the threshold that the accumulative distance values cannot exceed.
          • Any cell location that has accumulative Euclidean distance value exceed this value will be assigned NoData for output value.
          • The default is to the edge of the output raster.
        • cell_size(Optional): Cell size of the output raster
          • Either set in the environment settings of the workspace
          • Or depends on if the input source is:
            • raster: same cell size as the input raster
            • feature: determined by the shorter of the width or height of the extent of input feature divided by 250.
        • out_direction_raster(Optional): Output raster dataset
          • The calculated direction (in degrees) each cell center is from the closest source cell center.
        • Return value: Raster Layer
          • The distance raster identifies the Euclidean distance for each cell to the closest source cell, set of source cells, or source location(s).
          • The distances are measured as Euclidean distance in the projection units of the raster, and are computed from cell center to cell center.
      • Algorithm:
        • Euclidean distance is calculated from the center of the source cell to the center of each of the surrounding cells. The Euclidean algorithm works as follows: for each cell, the distance to each source cell is determined by calculating the hypotenuse with x and y distances between cell centers as the other two legs of the triangle.
          http://help.arcgis.com/en/arcgisdesktop/10.0/help/009z/GUID-B6170E16-50FB-45FB-A724-E24251CDF305-web.gif (Illustration from ArcGIS Online Help)
        • If the shortest distance to a source is less than the specified maximum distance, the value is assigned to the cell location on the output raster.
        • If the cell is at an equal distance to two or more sources, the cell is assigned to the source that is first encountered in the scanning process. User cannot control this scanning process.
        • The actual algorithm computes the information using a two-scan sequential process(need to figure it out?). This process makes the speed of the tool independent from the number of source cells, the distribution of the source cells, and the maximum distance specified. The only factor that influences the speed with which the tool executes is the size of the raster. The computation time is linearly proportional to the number of cells in the Analysis window.
      • Reference:
        http://help.arcgis.com/en/arcgisdesktop/10.0/help/index.html#//009z0000001p000000.htm


  • Euclidean Distance
    • GRASS
      • r.grow.distance or GDAL tools\ Proximity (Raster Distance) in QGIS
      • Concepts:
        • r.grow.distance metric=euclidean (default):
          generates raster maps representing the distance to the nearest non-null cell in the input map and/or the value of the nearest non-null cell.
        • gdal_proximity generates a raster proximity map indicating the distance from the center of each pixel to the center of the nearest target pixel in the source raster.
      • Function:
        r.grow.distance [-m] input=name [distance=name] [value=name] [metric=string] [—overwrite] [—verbose] [—quiet]
        • input: Raster dataset
          • Source cell locations are represented with non-null cell value
        • distance: Output Raster dataset
          • Identifies distances for each cell to the nearest source cell(s)
        • metric:
          • User specified four different metrics which control the geometry in which grown cells are created
          • Euclidean (by default), Squared, Manhattan, and Maximum
          gdal_proximity.py srcfile dstfile [-srcband n] [-dstband n] [-of format][-co name=value]* [-ot Byte/Int16/Int32/Float32/etc][-values n,n,n] [-distunits PIXEL/GEO] [-maxdist n] [-nodata n][-fixed-buf-val n]
        • srcfile: Raster file, identifies target pixels
        • dstfile: Output Raster file of proximity (distance)
        • -values:
          A list of target pixel values in the source image to be considered target pixels. If not specified, all non-zero pixels will be considered target pixels.
        • -mxdist:
          The maximum distance to be generated. All pixels beyond this distance will be assigned the nodata value.
      • Algorithm:
        • d(dx,dy) = sqrt(dx2 + dy2)
        • Cells grown would form isolines of distance that are circular from a given point, with the distance given by the radius.
      • Reference:
        http://grass.osgeo.org/manuals/html70_user/r.grow.distance.html
        http://www.gdal.org/gdal_proximity.html



  • Cost-weighted Distance
    • ArcGIS
      • Concepts:
        • Calculates the least accumulative cost distance for each cell to the nearest source over a cost surface.
          http://help.arcgis.com/en/arcgisdesktop/10.0/help/009z/GUID-E922AC15-69E2-4965-A49E-28B1017C85E6-web.png
      • Function:
        CostDistance (in_source_data, in_cost_raster, {maximum_distance}, {out_backlink_raster})
        • in_source_data: Raster Layer / Feature Layer (vector)
          • Identifies the cells(raster) or locations(feature) to which the Euclidean distance for every output cell location is calculated.
          • If the source is a raster layer, it must contain only the values of the source cells, while other cells must be NoData.
          • If the source is a feature layer, it will internally be transformed into a raster. By default, the first valid available field will be used, otherwise the ObjectID field will be used.
        • in_cost_raster: Raster Layer
          • The value of each cell represents the cost per unit distance for moving through the cell. Each cell location value is multiplied by the cell resolution while also compensating for diagonal movement to obtain the total cost of passing through the cell.
          • If the input source data and the cost raster are different extents, the default output extent is the intersection of the two. Or you can Union the input rasters to get a cost distance surface for the entire extent.
          • Cell locations with NoData in the Input cost raster act as barriers in the cost surface. Any cell location that is assigned NoData on the input cost raster will receive NoData on the output raster.
          • The values of the cost raster cannot be negative or zero. If the cost raster does contain values of zero, and:
            • If these values represent areas of lowest cost, change values of zero to a small positive value (such as 0.01) before running Cost Distance.
            • If areas with a value of zero represent barriers, these values should be turned to NoData before running Cost Distance.
        • maximum_distance: Double Precision
          • Defines the threshold that the accumulative cost values cannot exceed.
          • Any cell location that has accumulative cost distance value exceed this value will be assigned NoData for output value.
          • The default is to the edge of the output raster.
          • The unit is specified the same cost units as those on the cost raster.
        • out_backlink_raster: Raster Layer
          • Defines the direction or identify the next neighboring cell along the least accumulative cost path from a cell to reach its least cost source.
          • The first succeeding cell from the source cell will be assigned the value 1 and 2,3,… ,8 continuing clockwise. The value 0 is reserved for source cells.
        • Return Value: Raster Layer
          • Identifies the least accumulative cost distance for each cell over a cost surface to the closest source cell, set of source cells, or source location(s).
          • The least-cost distance of a cell to a set of source locations is the lower bound of the least-cost distances from the cell to all source locations.
      • Algorithm:
        • The algorithm utilizes the node/link cell representation, in which each center of a cell is considered a node and each node is connected to its adjacent nodes by multiple links.
        • Every link has an impedance derived from the costs associated with the cells at each end of the link and from the direction of movement through the cells. The cost assigned to each cell represents the cost per unit distance for moving through the cell, that is the cell size multiplied by the cost value.
          • Perpendicular node cost:
            The cost moving from cell1 to cell2 = (cost1 + cost2) / 2
          • Diagonal node cost:
            The cost moving from cell1 to cell2 = sqrt(2)*(cost1+cost2)/2
        • Creating an accumulative cost-distance raster is an iterative process that attempts to identify the lowest-cost cell and add it to an output list.
          • The process begins with the source cells.
          • Accumulative cost distances are calculated for all the neighboring cells, activated, put into a cost cell list and arranged rom the lowest to the highest.
          • The lowest cost cell is chosen from the active accumulative cost cell list, and the value is assigned to the output cost-distance raster for that cell location.
          • The neighborhood is expanded from the lowest cost cell, the new costs are calculated, and the new cost cells are added to the active list.
          • This allocation process continues until all cells have been chosen from the active list. The result is the accumulative cost or weighted-distance raster.
          • If there are multiple zones or disconnected sets of source cells on the input source raster, the growing process continues and allocates the cheapest cost cell from the active list regardless of which source it is from.
          • No travel is allowed through cells containing NoData values. The least accumulative cost for cells on the back side of a group of NoData cells is determined by the cost to travel around these locations. If a cell location is assigned NoData on the input cost raster, NoData will be assigned to the cell location on the cost distance output raster.
      • Reference:
        http://help.arcgis.com/en/arcgisdesktop/10.0/help/index.html#/Cost_Distance/009z00000018000000/


  • Cost-weighted Distance
    • GRASS
      • r.cost
      • Concepts:
        • Creates a raster map showing the cumulative cost of moving to each cell on a cost surface from other user-specified cell(s) whose locations are specified by their geographic coordinate(s) or from a raster points map.
        • Each cell value in the output raster presents the lowest total cost of traversing the cost surface between each cell and the user-specified points.
      • Function:
        r.cost [-knrv] input=name output=name [start_points=name] [stop_points=name] [start_rast=name] [coordinate=x,y[,x,y,…]] [stop_coordinate=x,y[,x,y,…]] [max_cost=cost] [null_cost=null cost] [percent_memory=percent memory] [—overwrite] [—verbose] [—quiet]
        • input: raster cost map
          • Value of each grid cell represents cost information
        • output: output raster map of cummulative cost
        • start_points:
          • Point(s) with geographic coordinates (coordinate)
          • Vector points file
          • Raster map with non-null cells representing starting points (start_rast)
        • stop_points:
          • Point(s) with geographic coordinates (stop_coordinate)
          • Vector points file
        • max_cost: maximum cumulative cost
          • The execution will stop accumulating costs when either max_cost is reached, or one of the stop points is reached. Once the cumulative cost to all stopping points has been determined, processing stops.
        • null_cost:
          • By default null cells in the input raster map are excluded from the algorithm, and thus retained in the output map.
          • The null cells in the input cost map can be assigned a positive cost with the null_cost option.
          • When input map null cells are given a cost with the null_cost option, the corresponding cells in the output map are no longer null cells. When specified null_cost=0.0, the movement transparently cross null cells which then propagate the adjacent costs.
          • By using the -n flag, the null cells of the input map will be retained as null cells in the output map.
        • -k flag:
          • Use Knight’s move to improve the accuracy of the output by growing cost distance surface outwards in 16 directions instead of 8 directions by default.
          • Computes slower, but more accurate
      • Algorithm:
        • Calculation begins with starting points.
        • Cells (from starting points to the neighboring cells growing in 8 or 16 directions) are put into a list, on which the cell with the lowest cumulative cost is selected for computing costs to the neighboring cells.
        • Neighbors are put into the list and originating cells are removed from the list.
        • This process continues until the list is empty.
        • A binary tree with an linked list at each node in the tree is used for efficiently holding cells with identical cumulative costs.
        • As the algorithm works through the dynamic list of cells it can move almost randomly around the entire area. It divides the entire area into a number of pieces and swaps these pieces in and out of memory as needed.
      • Reference:
        http://grass.fbk.eu/gdp/html_grass64/r.cost.html

   
back to top

Week 2 Report

What did you get done this week?

What do you plan on doing next week?

  • Learn more about PostGIS Raster development
  • Write a comparison of raster data storage and manipulation in PostgreSQL and Oracle
  • Write a proposal on how to adopt the concepts and algorithms of distance calculation in PostGIS.

Are you blocked on anything?

  • It seems that Oracle Sptial dosenot provide distance analysis functions for GeoRaster data. Please let me know if I missed it.
    However, by reading documents of GeoRaster, I got a better understanding of the raster data storage in spatial database. So I feel it would be helpful for me to write an analysis to compare the concepts of raster data storage and manipulation with PostGIS Raster and Oracle Spatial GeoRaster.

Comment from Mentor:

I really think we want to avoid having to produce an intermediate raster of sources. PostGIS is strong on raster/vector interaction and I really don't see why someone would prefer to provide sources as raster. The sources should come from a table of point. Now this raise a certain number of issues:

-What if the table contains more than one point?
-What if some of those points are outside the raster extent?
-Another issue is: What if the raster I want to produce does not fit into the PostgreSQL maximum field size?

These are the kind or difficulties one encounters when working in a database.

Think about large tiled coverage, Delaunay triangulation, aggregate functions and all the combinatory of "small/very large number of point" and "small/very large raster coverage". Ideally we want to provide a solution working for every limit cases.



   
back to top

Week 3 Report

What did you get done this week?

  • Wrote a comparison of raster data storage and manipulation in PostgreSQL and Oracle
  • Wrote a proposal on how to adopt the concepts and algorithms of Euclidean distance calculation in PostGIS.

What do you plan on doing next week?

  • Write a proposal on how to adopt the concepts and algorithms of Cost-weighted distance calculation in PostGIS

Are you blocked on anything?

  • Based on the feed back I got from Pierre last week, I agreed that we want to avoid having to produce an intermediate raster of the source point data. Since PostGIS Raster provide the capability of seamless vector-raster interactions, it is preferable that we expect the input source point data to be a vector layer, which is stored in PostgreSQL as a table of points with geomegry. The concepts of the distance calculation in ArcGIS and GRASS were all based on raster source data (ArcGIS will first rasterize the vector layer if necessary). So I was stucked at this point on how to avoid generating this intermediate raster layer if we want the source data to be vector points.


My Analysis Reports


Comment from Mentor:

- "Algorythm: a. Make an empty raster new_rast, using the same extent as the input vector point data source_point, set pixel value to NoData value." I don't think this is a good idea as you would have no control on the alignment of the raster produced. I want to be able to control the dimensions and how the raster is georeferenced. I think your best friend here is ST_AsRaster() which allows passing a geometry and some raster properties and burns the geometry in the raster. Basically ST_AsDistanceRaster() should not be so different from ST_AsRaster(), the only difference is that instead of burning the geometry in the raster, we compute a distance to the geometry for each pixel. It is very possible in this case that the geometry will be outside the extent of the requester raster. Should we want this to work for lines and polygons?

-You should avoid to do two passes to compute the final pixel values: one to compute the distance in pixel/cell and another to convert into the geographical distance.

-What do ArcGIS and GRASS do when there is more than one point? We need a more detailed approach for this case.

-I want also a kind of approach to the problem of billions of points. Does the knn indexing help us in quickly identifying what is the nearest point to the pixel without rescanning all the point for every pixels.

-In general I don't think just copying what ArcGIS or GRASS used to do is what we want. We want a true raster/vector integrated approach, scalable to billions of points, enabling the production of very large tiled rasters. I know this is a much more difficult problem but we want to end up with a variety of easy to use tools. I think this would be a much greater addition to PostGIS than just converting one raster to another raster. Think about aggregate functions or passing a geometry table name to the function. ST_AsRaster was done to work on only one geometry but how could it work to burn many geometries into one raster? Look at ST_UnionToRaster in the PostGIS raster working specifications. (http://trac.osgeo.org/postgis/wiki/WKTRaster/SpecificationWorking03)

-Look also at this mail: http://postgis.refractions.net/pipermail/postgis-users/2012-April/033875.html it might be a general solution working on small and big raster but might be slower than a specific solution. We could simply implement a bunch of custom function in C...

-Start thinking about the signature of some prototype SQL functions. Think about passing the name of a point table

-I think the constraints should be explicited first in your document so we know in which context we want all this to work.


I think the solution to this multipoint problem is the key to the algorithm we want to implement. There would be only one point, there would be no problem. The problem is: How do I know which distance to compute (to which point) for each pixel while I iterate on them. What if the number of point is greater than the number of pixel? Should we iterate on the point in this case instead of iterating on the pixels?

The point here is what to scan first and how many times? Do they rescan the whole raster for every points? That's ok if you have a low number of points but what if you have millions?

I think with the knn index already built you could scan pixels only once assigning the distance to the closest point. and that would be fast even if there are millions. Is this correct?

>> From: Pierre
>> I want also a kind of approach to the problem of billions of points. Does
>>the knn indexing help us in quickly identifying what is the nearest point to the
>>pixel without rescanning all the point for every pixels.
> From: Qing
> Yes, KNN index will definitely help us to identify the nearest point to the pixel.
> With the capability of PostGIS 2.0 to return sorted results from a GiST index, we
> can have the results ordered by the geometry distance between pixel centroid
> and identify the shortest one without having to rescan all the points. However,
> the "order by distance between bounding box centroid" works now for
> geometries. When it comes to point data, the bounding boxes are equivalent to
> the points themselves. So the same would apply to the center of pixel in our
> calculation. We now have a GiST index over the raster column. However, I am
> not sure about how we build spatial index embedded in the pixels or the raster
> geometry.

Not sure I understand well. Could converting pixel centroids into points and computing the knn index on the mix of points and centroids converted to points be an avenue?

I think, considering all the constraints, that we might have to end up with different approaches depending on different constraints. A simple one if some constraints are involved, a more complex one if more constraints are involved.

So in your analyses you have to put into relationship: constraints, features (like the possibility to define a limited distance or the possibility to work with line and polygons), possibilities (limits of the data structure imposed by the DB like the fact that all the geometries are on a different row), scalability, flexibility (we want a generic solution to implement other kind of distances) and performance.



   
back to top

Week 4 Report

What did you get done this week?

  • Revised proposal for creating distance surface based on the comments and discussions with Pierre.

What do you plan on doing next week?

  • Revise the proposal from last week according to the feedback from the mentor
  • Write more detailed approaches according to different constraints
  • Transfer all the Google docs documents to the wiki page

Are you blocked on anything?

  • Pierre's comments and feedbacks are very helpful in guiding me through the problems I've encountered.


My Analysis Reports

Note: See TracWiki for help on using the wiki.