wiki:PostGIS_Raster_SoC_Idea_2012/Distance_Analysis_Tools

Version 69 (modified by qliu, 12 years ago) ( diff )

PostGIS Raster SoC Idea 2012 - Distance Analysis Tools for PostGIS Raster


Student:  Qing Liu


Mentor: Pierre Racine


Idea:
The idea for this proposed project is to add two spatial analysis functions to PostGIS Raster that implement two main ways of performing distance analysis: Euclidean distance calculation and cost-weighted distance calculation.

Euclidean distance function will create a distance surface representing the Euclidean distance from each cell in the source layer to the starting point or the closest source (as designated by user). The basic concepts of the algorithm would be using the center of the source cell to calculate the distance from it to the rest cells in the raster layer or within the user-defined extent.

Cost-weighted distance will create a raster layer representing the accumulative cost distance for each cell to the starting point or the closest source (as designated by user). A cost raster layer will be generated using one or several factor layers representing the impedance of passing through each cell according to user’s criteria. User can also put weights on the input factor layers to represent different levels of importance for those factors. Factors such as: elevation, slope, orientation, land use / land cover type; vehicle speed, accessibility; and a binary map layer could also be used as a mask for defining geographic extent or as a filter for the output cost layer. The accumulative cost values will be then assigned to each cell representing the cost per unit distance for moving through this cell.

GSoC 2012 Proposal:
http://google-melange.appspot.com/gsoc/proposal/review/google/gsoc2012/kya_na/1


Document Page:
http://trac.osgeo.org/postgis/wiki/PostGIS_Raster_SoC_Idea_2012/Distance_Analysis_Tools/document http://google-melange.appspot.com/gsoc/proposal/review/google/gsoc2012/kya_na/1


Weekly Reports




   
back to top

Week 1 Report

What did you get done this week?

  • Compiled PostGIS 2.0 sucessfully.
  • Loaded raster data into PostgreSQL and practiced some query by finishing the raster/vector tutorial.
  • Wrote an analysis of how Euclidean distance and Cost-weighted distance are computed in ArcGIS and GRASS.
  • Setup wiki page.

What do you plan on doing next week?

  • Learn more about PostGIS Raster concepts and functions
  • Write an analysis of how Euclidean distance and Cost-weighted distance are computed in Oracle
  • Start to write a proposal on how to adopt the concepts and algorithms in PostGIS.

Are you blocked on anything?

  • As of now I was not blocked on anything yet, but working in a spatial database is something new and challenging to me. I will need to read and reserch more about it.
  • However, it took me some time to understand how raster coverage is stored in PostGIS, and how Raster type works.


Analysis Report

How Euclidean distance and Cost-weighted distance are computed in ArcGIS and GRASS

  • Euclidean Distance
    • ArcGIS
      • Concepts:
        The Euclidean Distance generates a raster layer that contains the measured distance from every cell to the nearest source. http://help.arcgis.com/en/arcgisdesktop/10.0/help/009z/GUID-7D94981B-1DBB-40F9-9F3F-B025C9F52BCD-web.png (Illustration from ArcGIS Online Help)
      • Function:
        EucDistance (in_source_data, {maximum_distance}, {cell_size}, {out_direction_raster})
        • in_source_data: Raster Layer / Feature Layer (vector)
          • Identifies the cells(raster) or locations(feature) to which the Euclidean distance for every output cell location is calculated.
          • If the source is a raster layer, it must contain only the values of the source cells, while other cells must be NoData.
          • If the source is a feature layer, it will internally be transformed into a raster. The resolution of the converted raster can be controlled with the output cell_size parameter.
        • maximum_distance(Optional): Double Precision
          • Defines the threshold that the accumulative distance values cannot exceed.
          • Any cell location that has accumulative Euclidean distance value exceed this value will be assigned NoData for output value.
          • The default is to the edge of the output raster.
        • cell_size(Optional): Cell size of the output raster
          • Either set in the environment settings of the workspace
          • Or depends on if the input source is:
            • raster: same cell size as the input raster
            • feature: determined by the shorter of the width or height of the extent of input feature divided by 250.
        • out_direction_raster(Optional): Output raster dataset
          • The calculated direction (in degrees) each cell center is from the closest source cell center.
        • Return value: Raster Layer
          • The distance raster identifies the Euclidean distance for each cell to the closest source cell, set of source cells, or source location(s).
          • The distances are measured as Euclidean distance in the projection units of the raster, and are computed from cell center to cell center.
      • Algorithm:
        • Euclidean distance is calculated from the center of the source cell to the center of each of the surrounding cells. The Euclidean algorithm works as follows: for each cell, the distance to each source cell is determined by calculating the hypotenuse with x and y distances between cell centers as the other two legs of the triangle.
          http://help.arcgis.com/en/arcgisdesktop/10.0/help/009z/GUID-B6170E16-50FB-45FB-A724-E24251CDF305-web.gif (Illustration from ArcGIS Online Help)
        • If the shortest distance to a source is less than the specified maximum distance, the value is assigned to the cell location on the output raster.
        • If the cell is at an equal distance to two or more sources, the cell is assigned to the source that is first encountered in the scanning process. User cannot control this scanning process.
        • The actual algorithm computes the information using a two-scan sequential process(need to figure it out?). This process makes the speed of the tool independent from the number of source cells, the distribution of the source cells, and the maximum distance specified. The only factor that influences the speed with which the tool executes is the size of the raster. The computation time is linearly proportional to the number of cells in the Analysis window.
      • Reference:
        http://help.arcgis.com/en/arcgisdesktop/10.0/help/index.html#//009z0000001p000000.htm


  • Euclidean Distance
    • GRASS
      • r.grow.distance or GDAL tools\ Proximity (Raster Distance) in QGIS
      • Concepts:
        • r.grow.distance metric=euclidean (default):
          generates raster maps representing the distance to the nearest non-null cell in the input map and/or the value of the nearest non-null cell.
        • gdal_proximity generates a raster proximity map indicating the distance from the center of each pixel to the center of the nearest target pixel in the source raster.
      • Function:
        r.grow.distance [-m] input=name [distance=name] [value=name] [metric=string] [—overwrite] [—verbose] [—quiet]
        • input: Raster dataset
          • Source cell locations are represented with non-null cell value
        • distance: Output Raster dataset
          • Identifies distances for each cell to the nearest source cell(s)
        • metric:
          • User specified four different metrics which control the geometry in which grown cells are created
          • Euclidean (by default), Squared, Manhattan, and Maximum
          gdal_proximity.py srcfile dstfile [-srcband n] [-dstband n] [-of format][-co name=value]* [-ot Byte/Int16/Int32/Float32/etc][-values n,n,n] [-distunits PIXEL/GEO] [-maxdist n] [-nodata n][-fixed-buf-val n]
        • srcfile: Raster file, identifies target pixels
        • dstfile: Output Raster file of proximity (distance)
        • -values:
          A list of target pixel values in the source image to be considered target pixels. If not specified, all non-zero pixels will be considered target pixels.
        • -mxdist:
          The maximum distance to be generated. All pixels beyond this distance will be assigned the nodata value.
      • Algorithm:
        • d(dx,dy) = sqrt(dx2 + dy2)
        • Cells grown would form isolines of distance that are circular from a given point, with the distance given by the radius.
      • Reference:
        http://grass.osgeo.org/manuals/html70_user/r.grow.distance.html
        http://www.gdal.org/gdal_proximity.html



  • Cost-weighted Distance
    • ArcGIS
      • Concepts:
        • Calculates the least accumulative cost distance for each cell to the nearest source over a cost surface.
          http://help.arcgis.com/en/arcgisdesktop/10.0/help/009z/GUID-E922AC15-69E2-4965-A49E-28B1017C85E6-web.png
      • Function:
        CostDistance (in_source_data, in_cost_raster, {maximum_distance}, {out_backlink_raster})
        • in_source_data: Raster Layer / Feature Layer (vector)
          • Identifies the cells(raster) or locations(feature) to which the Euclidean distance for every output cell location is calculated.
          • If the source is a raster layer, it must contain only the values of the source cells, while other cells must be NoData.
          • If the source is a feature layer, it will internally be transformed into a raster. By default, the first valid available field will be used, otherwise the ObjectID field will be used.
        • in_cost_raster: Raster Layer
          • The value of each cell represents the cost per unit distance for moving through the cell. Each cell location value is multiplied by the cell resolution while also compensating for diagonal movement to obtain the total cost of passing through the cell.
          • If the input source data and the cost raster are different extents, the default output extent is the intersection of the two. Or you can Union the input rasters to get a cost distance surface for the entire extent.
          • Cell locations with NoData in the Input cost raster act as barriers in the cost surface. Any cell location that is assigned NoData on the input cost raster will receive NoData on the output raster.
          • The values of the cost raster cannot be negative or zero. If the cost raster does contain values of zero, and:
            • If these values represent areas of lowest cost, change values of zero to a small positive value (such as 0.01) before running Cost Distance.
            • If areas with a value of zero represent barriers, these values should be turned to NoData before running Cost Distance.
        • maximum_distance: Double Precision
          • Defines the threshold that the accumulative cost values cannot exceed.
          • Any cell location that has accumulative cost distance value exceed this value will be assigned NoData for output value.
          • The default is to the edge of the output raster.
          • The unit is specified the same cost units as those on the cost raster.
        • out_backlink_raster: Raster Layer
          • Defines the direction or identify the next neighboring cell along the least accumulative cost path from a cell to reach its least cost source.
          • The first succeeding cell from the source cell will be assigned the value 1 and 2,3,… ,8 continuing clockwise. The value 0 is reserved for source cells.
        • Return Value: Raster Layer
          • Identifies the least accumulative cost distance for each cell over a cost surface to the closest source cell, set of source cells, or source location(s).
          • The least-cost distance of a cell to a set of source locations is the lower bound of the least-cost distances from the cell to all source locations.
      • Algorithm:
        • The algorithm utilizes the node/link cell representation, in which each center of a cell is considered a node and each node is connected to its adjacent nodes by multiple links.
        • Every link has an impedance derived from the costs associated with the cells at each end of the link and from the direction of movement through the cells. The cost assigned to each cell represents the cost per unit distance for moving through the cell, that is the cell size multiplied by the cost value.
          • Perpendicular node cost:
            The cost moving from cell1 to cell2 = (cost1 + cost2) / 2
          • Diagonal node cost:
            The cost moving from cell1 to cell2 = sqrt(2)*(cost1+cost2)/2
        • Creating an accumulative cost-distance raster is an iterative process that attempts to identify the lowest-cost cell and add it to an output list.
          • The process begins with the source cells.
          • Accumulative cost distances are calculated for all the neighboring cells, activated, put into a cost cell list and arranged rom the lowest to the highest.
          • The lowest cost cell is chosen from the active accumulative cost cell list, and the value is assigned to the output cost-distance raster for that cell location.
          • The neighborhood is expanded from the lowest cost cell, the new costs are calculated, and the new cost cells are added to the active list.
          • This allocation process continues until all cells have been chosen from the active list. The result is the accumulative cost or weighted-distance raster.
          • If there are multiple zones or disconnected sets of source cells on the input source raster, the growing process continues and allocates the cheapest cost cell from the active list regardless of which source it is from.
          • No travel is allowed through cells containing NoData values. The least accumulative cost for cells on the back side of a group of NoData cells is determined by the cost to travel around these locations. If a cell location is assigned NoData on the input cost raster, NoData will be assigned to the cell location on the cost distance output raster.
      • Reference:
        http://help.arcgis.com/en/arcgisdesktop/10.0/help/index.html#/Cost_Distance/009z00000018000000/


  • Cost-weighted Distance
    • GRASS
      • r.cost
      • Concepts:
        • Creates a raster map showing the cumulative cost of moving to each cell on a cost surface from other user-specified cell(s) whose locations are specified by their geographic coordinate(s) or from a raster points map.
        • Each cell value in the output raster presents the lowest total cost of traversing the cost surface between each cell and the user-specified points.
      • Function:
        r.cost [-knrv] input=name output=name [start_points=name] [stop_points=name] [start_rast=name] [coordinate=x,y[,x,y,…]] [stop_coordinate=x,y[,x,y,…]] [max_cost=cost] [null_cost=null cost] [percent_memory=percent memory] [—overwrite] [—verbose] [—quiet]
        • input: raster cost map
          • Value of each grid cell represents cost information
        • output: output raster map of cummulative cost
        • start_points:
          • Point(s) with geographic coordinates (coordinate)
          • Vector points file
          • Raster map with non-null cells representing starting points (start_rast)
        • stop_points:
          • Point(s) with geographic coordinates (stop_coordinate)
          • Vector points file
        • max_cost: maximum cumulative cost
          • The execution will stop accumulating costs when either max_cost is reached, or one of the stop points is reached. Once the cumulative cost to all stopping points has been determined, processing stops.
        • null_cost:
          • By default null cells in the input raster map are excluded from the algorithm, and thus retained in the output map.
          • The null cells in the input cost map can be assigned a positive cost with the null_cost option.
          • When input map null cells are given a cost with the null_cost option, the corresponding cells in the output map are no longer null cells. When specified null_cost=0.0, the movement transparently cross null cells which then propagate the adjacent costs.
          • By using the -n flag, the null cells of the input map will be retained as null cells in the output map.
        • -k flag:
          • Use Knight’s move to improve the accuracy of the output by growing cost distance surface outwards in 16 directions instead of 8 directions by default.
          • Computes slower, but more accurate
      • Algorithm:
        • Calculation begins with starting points.
        • Cells (from starting points to the neighboring cells growing in 8 or 16 directions) are put into a list, on which the cell with the lowest cumulative cost is selected for computing costs to the neighboring cells.
        • Neighbors are put into the list and originating cells are removed from the list.
        • This process continues until the list is empty.
        • A binary tree with an linked list at each node in the tree is used for efficiently holding cells with identical cumulative costs.
        • As the algorithm works through the dynamic list of cells it can move almost randomly around the entire area. It divides the entire area into a number of pieces and swaps these pieces in and out of memory as needed.
      • Reference:
        http://grass.fbk.eu/gdp/html_grass64/r.cost.html

   
back to top

Week 2 Report

What did you get done this week?

What do you plan on doing next week?

  • Learn more about PostGIS Raster development
  • Write a comparison of raster data storage and manipulation in PostgreSQL and Oracle
  • Write a proposal on how to adopt the concepts and algorithms of distance calculation in PostGIS.

Are you blocked on anything?

  • It seems that Oracle Sptial dosenot provide distance analysis functions for GeoRaster data. Please let me know if I missed it.
    However, by reading documents of GeoRaster, I got a better understanding of the raster data storage in spatial database. So I feel it would be helpful for me to write an analysis to compare the concepts of raster data storage and manipulation with PostGIS Raster and Oracle Spatial GeoRaster.

Comment from Mentor:

I really think we want to avoid having to produce an intermediate raster of sources. PostGIS is strong on raster/vector interaction and I really don't see why someone would prefer to provide sources as raster. The sources should come from a table of point. Now this raise a certain number of issues:

-What if the table contains more than one point?
-What if some of those points are outside the raster extent?
-Another issue is: What if the raster I want to produce does not fit into the PostgreSQL maximum field size?

These are the kind or difficulties one encounters when working in a database.

Think about large tiled coverage, Delaunay triangulation, aggregate functions and all the combinatory of "small/very large number of point" and "small/very large raster coverage". Ideally we want to provide a solution working for every limit cases.



   
back to top

Week 3 Report

What did you get done this week?

  • Wrote a comparison of raster data storage and manipulation in PostgreSQL and Oracle
  • Wrote a proposal on how to adopt the concepts and algorithms of Euclidean distance calculation in PostGIS.

What do you plan on doing next week?

  • Write a proposal on how to adopt the concepts and algorithms of Cost-weighted distance calculation in PostGIS

Are you blocked on anything?

  • Based on the feed back I got from Pierre last week, I agreed that we want to avoid having to produce an intermediate raster of the source point data. Since PostGIS Raster provide the capability of seamless vector-raster interactions, it is preferable that we expect the input source point data to be a vector layer, which is stored in PostgreSQL as a table of points with geomegry. The concepts of the distance calculation in ArcGIS and GRASS were all based on raster source data (ArcGIS will first rasterize the vector layer if necessary). So I was stucked at this point on how to avoid generating this intermediate raster layer if we want the source data to be vector points.


Analysis Reports

Compare Raster Data Storage and Manipulation in PostgreSQL and Oracle GeoRaster

  1. Raster Data Storage:
    • PostGIS Raster
      • Stored as a single type (Raster) in the table
      • One table = One Coverage;
        One table row = One Raster (Tile)
      • Raster stored in database:
        https://lh6.googleusercontent.com/-pX0HwnZvesQ/T-AKwmL6-iI/AAAAAAAAA4M/nS97sBHHQvA/s0-d/Compare%2BRaster%2BData%2BStorage%2Band%2BManipulation%2Bin%2BPostgreSQL%2Band%2BOracle%2BGeoRaster.jpg
    • Oracle GeoRaster
      • Stored as two related types in different tables
        • SDO_GEORASTER table: Images
        • SDO_RASTER table: Tiles
        • Raster stored in two database tables:
          https://lh6.googleusercontent.com/-CR2KxwqC4q8/T-AKwg0tzCI/AAAAAAAAA4Q/ObDWA9Y52FI/s532/Compare+Raster+Data+Storage+and+Manipulation+in+PostgreSQL+and+Oracle+GeoRaster%281%29.jpg
  2. Georeferencing:
    • ostGIS Raster:
      • Embedded in raster header
      • Each raster (tile) is geo-referenced
    • Oracle GeoRaster:
      • Embedded as a metadata component of the GeoRaster object
      • Only SDO_RASTER is geo-referenced
  3. Spatial Indexing:
    • PostGIS Raster:
      • GiST index on raster column
    • Oracle GeoRaster:
  4. Pyramids:
    • PostGIS Raster:
      • Pyramids generated on loading time
      • Pyramids stored in different table than the raster data
    • Oracle GeoRaster:
      • Reduced resolution versions of raster generated by resampling
      • Pyramids stored in the same raster table for GeoRaster object
  5. Loading Data:
    • PostGIS Raster:
      • All GDAL accepted formats
      • Easy
    • Oracle GeoRaster:
      • Few formats accepted
      • Hard
  6. Raster Data Implementaion (take vector-raster intersection for example):
    • PostGIS Raster:
      • Implement seamless vector-raster analyses
      • Do analysis independently of the data presentation
      • Vector-Raster intersection:
      • Really intersect Vector Data with Raster Data
      • Raster data is first polygonized to be intersected with Vector data
    • Oracle GeoRaster:
      • Was primarily designed for raster data storage not for anlysis
      • Vector-Raster intersection:
      • Intersect Vector data with MBR of raster data not with the raster data itself
      • Process is faster though

Generate Euclidean Distance Surface in PostGIS (Proposal) (first working on a single raster row)

  • Concepts:
    • We first think of working on a single raster row (a single tile) containing only one band and then maybe apply it on a tiled raster coverage with multi-bands. We first assume the source point(s) is/are within the raster extent.
    • Generate a raster tile in which the value for each pixel represents the Euclidean distance from the pixel to the nearest source point.
    • Euclidean distance is calculated from the center of the source pixel to the center of each of the surrounding pixels.
  • What should be the input source data?
    • ArcGIS accepts both raster and vector layers as input source data, but will first transform the source layer into raster layer if it is in the vector format; Thus, it produces an intermediate raster of sources.
    • Grass and GDAL proximity tools accept only raster layer as the input source data.
    • We want to avoid having to produce an intermediate raster of sources, since PostGIS implements seamless vector-raster interactions.
    • Thus, we expect input source data to be a vector point layer, which is stored as a table of point with geometries.
  • What should be the output result?
    • Results are stored as pixel values in the raster
  • Algorithms:
    1. Make an empty raster new_rast, using the same extent as the input vector point data source_point, set pixel value to NoData value.
      Comment from Mentor:

      I don't think this is a good idea as you would have no control on the alignment of the raster produced. I want to be able to control the dimensions and how the raster is georeferenced. I think your best friend here is ST_AsRaster() which allows passing a geometry and some raster properties and burns the geometry in the raster. Basically ST_AsDistanceRaster() should not be so different from ST_AsRaster(), the only difference is that instead of burning the geometry in the raster, we compute a distance to the geometry for each pixel. It is very possible in this case that the geometry will be outside the extent of the requester raster. Should we want this to work for lines and polygons?

    2. Utilize ST_SetValue(raster rast, geometry pt, double newvalue) function to designate source pixels in the resulting raster where vector source point(s) intersect(s) with the new raster.We want to assign “0” as new pixel value to pixels as source pixel(point) since the Euclidean distance from source to itself is zero.
    3. Calculate Euclidean distance from the center of source pixel(s) to each pixel in the raster:
      d(dx,dy) = sqrt(dx2 + dy2)
      Unit is pixel/cell
    4. Set pixel value to the resulted distance
      Need to consider scan algorithm in case of more than one point in the source data
    5. Utilize ST_MapAlgebraExpr() function to multiply resulting distance raster with pixel size to get the real geographic distance in the unit assigned in the georeferencing info.
      Comment from Mentor:

      You should avoid to do two passes to compute the final pixel values: one to compute the distance in pixel/cell and another to convert into the geographical distance.

  • Issues to be considered:
    • Scanning method in case of source point data containing more than one source point:
      • Shortest distance will be assigned to resulted raster
      • In case there is an equal distance to more than one source, the pixel is assigned to the source that is first encountered in the scanning process
    • Specify maximum distance as the threshold that the accumulative distance values cannot exceed
      • Any pixel that has accumulative Euclidean distance value exceed this value will be assigned NoData for output value.
      • The default threshold is to the edge of the output raster.
    • How to apply to multi-band tiled raster coverage
      • Some of the source points are outside the raster extent
    • Raster storage in PostgreSQL
      • The resulting raster does not fit into the PostgreSQL maximum field size

Comment from Mentor:

-What do ArcGIS and GRASS do when there is more than one point? We need a more detailed approach for this case.

-I want also a kind of approach to the problem of billions of points. Does the knn indexing help us in quickly identifying what is the nearest point to the pixel without rescanning all the point for every pixels.

-In general I don't think just copying what ArcGIS or GRASS used to do is what we want. We want a true raster/vector integrated approach, scalable to billions of points, enabling the production of very large tiled rasters. I know this is a much more difficult problem but we want to end up with a variety of easy to use tools. I think this would be a much greater addition to PostGIS than just converting one raster to another raster. Think about aggregate functions or passing a geometry table name to the function. ST_AsRaster was done to work on only one geometry but how could it work to burn many geometries into one raster? Look at ST_UnionToRaster in the PostGIS raster working specifications. (http://trac.osgeo.org/postgis/wiki/WKTRaster/SpecificationWorking03)

-Look also at this mail: http://postgis.refractions.net/pipermail/postgis-users/2012-April/033875.html it might be a general solution working on small and big raster but might be slower than a specific solution. We could simply implement a bunch of custom function in C...

-Start thinking about the signature of some prototype SQL functions. Think about passing the name of a point table

-I think the constraints should be explicited first in your document so we know in which context we want all this to work.


I think the solution to this multipoint problem is the key to the algorithm we want to implement. There would be only one point, there would be no problem. The problem is: How do I know which distance to compute (to which point) for each pixel while I iterate on them. What if the number of point is greater than the number of pixel? Should we iterate on the point in this case instead of iterating on the pixels?

The point here is what to scan first and how many times? Do they rescan the whole raster for every points? That's ok if you have a low number of points but what if you have millions?

I think with the knn index already built you could scan pixels only once assigning the distance to the closest point. and that would be fast even if there are millions. Is this correct?

Pierre Wrote:
>> I want also a kind of approach to the problem of billions of points. Does
>>the knn indexing help us in quickly identifying what is the nearest point to the
>>pixel without rescanning all the point for every pixels.
Qing Wrote:
> Yes, KNN index will definitely help us to identify the nearest point to the pixel.
> With the capability of PostGIS 2.0 to return sorted results from a GiST index, we
> can have the results ordered by the geometry distance between pixel centroid
> and identify the shortest one without having to rescan all the points. However,
> the "order by distance between bounding box centroid" works now for
> geometries. When it comes to point data, the bounding boxes are equivalent to
> the points themselves. So the same would apply to the center of pixel in our
> calculation. We now have a GiST index over the raster column. However, I am
> not sure about how we build spatial index embedded in the pixels or the raster
> geometry.

Not sure I understand well. Could converting pixel centroids into points and computing the knn index on the mix of points and centroids converted to points be an avenue?

I think, considering all the constraints, that we might have to end up with different approaches depending on different constraints. A simple one if some constraints are involved, a more complex one if more constraints are involved.

So in your analyses you have to put into relationship: constraints, features (like the possibility to define a limited distance or the possibility to work with line and polygons), possibilities (limits of the data structure imposed by the DB like the fact that all the geometries are on a different row), scalability, flexibility (we want a generic solution to implement other kind of distances) and performance.



   
back to top

Week 4 Report

What did you get done this week?

  • Revised proposal for creating distance surface based on the comments and discussions with Pierre.

What do you plan on doing next week?

  • Revise the proposal from last week according to the feedback from the mentor
  • Write more detailed approaches according to different constraints
  • Transfer all the Google docs documents to the wiki page

Are you blocked on anything?

  • Pierre's comments and feedbacks are very helpful in guiding me through the problems I've encountered.


Analysis Report

Revised Proposal of Creating Distance Surface in PostGIS Raster

  • Case 1 - One-Point Source
    To create distance surface from source data containing only one point geometry as the source point.
    • Inputs:
      • geometry - One single PostGIS geometry to create distance surface from
      • geographic extent - Dimension of the output raster
      • georeferencing - Information of how the raster is georeferenced
      • maximum distance - The threshold that the accumulative distance values cannot exceed. When exceeding this value, NoData will be assigned for the output pixel value. The default threshold is to the edge of the output raster extent.
    • Output:
      Generate one single raster (rectan) in which pixel values represent the Euclidean distance from the center of the pixels to the center of the source point.
    • Proposed Algorithm:
      • Calculate Euclidean distance using the coordinates of the center of pixel and the center of the source point (d(dx,dy) = sqrt(dx2 + dy2)). If distance exceed max_distance specified, NoData value is used instead.
      • Create a single raster with distance as pixel value (ST_RasterFromText()?) using dimensions and georeferencing information specified
  • Case 2 - Multi-point Source
    To create distance surface from source data containing multiple point geometries as the source points.
    • Inputs:
      • geometry - Multiple PostGIS geometries to create distance surface from
      • geographic extent - Dimension of the output raster
      • georeferencing - Information of how the raster is georeferenced
      • maximum distance - The threshold that the accumulative distance values cannot exceed. When exceeding this value, NoData will be assigned for the output pixel value. The default threshold is to the edge of the output raster extent.
    • Output:
      Generate one single raster (rectan) in which pixel values represent the Euclidean distance from the center of the pixels to the center of its nearest source point.
    • Proposed Algorithm:
      • Scan through pixels, compute the KNN index for each pixel to its nearest source point, assign Euclidean distance using the coordinates of the center of the pixel and the center of its nearest source point (d(dx,dy) = sqrt(dx2 + dy2)). If distance exceed max_distance specified, NoData value is used instead.
      • Create a single raster with distance as pixel value (ST_RasterFromText()?) using dimensions and georeferencing information specified
    • Scalability:
      • In case of source data being large size dataset (e.g. containing billions of points), the KNN indexing approach should be able to help us in quickly identifying the nearest point to the pixel without rescanning all the points for every pixel.
    • Possibility:
      • It is possible to work with line geometries and polygon geometries utilizing the algorithms for the multi-point solution
        • distance will be computed from the center of each pixel to the center of its nearest pixel on the line or polygon
          • In case of single line or polygon geometry, the default output raster extent will be the same as the source line or polygon.
          • Note that pixels within the polygon will be assigned “0” distance or NoData value
    • Performance:
      • The speed depends on the number of source points, as well as the size of the produced raster.
      • By utilizing KNN indexing, we expect the speed to be independent from the number of the source points.
      • There are limits of the data structure imposed by the database needed to be considered like the fact that all the geometries are stored in different rows.
    • Flexibility:
      • The solution for calculating Euclidean distance can be implemented for other kinds of distances (e.g. cost-weighted distance). Euclidean distance will be used as the base layer. The cost-weighted distance could be calculated from the Euclidean distance surface with the cost surface. ST_MapAlgebraExpr two raster bands version could be utilized for computing.

Comment from Mentor:

I find the constraints are not really well stated as a first part of the document. Could you list those constraints, right here, so we agree on them?
Also:
To produce a distance raster we need to know which point is the nearest from the pixel centroid.
To produce a more generic interpolation surface (as we would like to be able to do), which points do we need to know? Only the three nearest points? The three nearest points forming a triangle encompassing the pixel centroid? More points?
Maybe those two problems can resume to one if we would build a Delauney triangulation with the point before trying to produce any surface. So the process would be two steps: 1) Build a Delauney surface 2) Compute some metric for each pixel centroid.


Comment from PostGIS-Devel:

Paul Ramsey Wrote:
On the algorithm side, the first case is "OK", though of limited value I would imagine (most use cases will involve multiple inputs, and not just points).
On the multiple inputs side, you really have to grapple with the fact that people will be starting from many different kinds of geometry, so your input is best not thought of as "a multipoint" but "a grid of starting locations", which could be derived from any kind of geometry.
I also note that the distance grid is actually just a specialization of a generalized cost surface, where grid values are calculated as using input of a starting point grid and a friction grid. In the case of the distance grid you are just assuming a uniform friction grid, but for maximum utility you might want to consider doing a cost surface instead.
The algorithm is much less parallelizable, because you have to work out from the start points, however it's a nice little recursive function. From each start point you calculate the cost of the neighbors as a product of the friction of the neighbor and the distance to the neighbor (one unit in left-right-up-down directions, root-2 units in the diagonals). If a cell already has a value, skip it. You can use a threshold as in your algorithm as a stopping condition.
With a friction grid of 1, this nets out to the distance grid.
Start only from points seems fundamentally limiting for no good reason. The added "precision" you get by working directly against the points seems pretty pointless for most grids. You'll get a lot of users forced to convert their input geometries into point sets before starting, and then their question will be "why is it slow" and you'll say "because you have too many points, then them" and they'll say "how" and you'll say "snap them to a grid basis" and at that point they might as well have rasterized anyways.
I still think a generic cost calculator would be more useful than a single-purpose distance calculator.



   
back to top

Week 5 Report

What did you get done this week? & Are you blocked on anything?

  • Posted project to postgis-devel hoping to get feedback and suggestions from the community
  • Discussed with mentor and people from postgis-devel list (see comments part from Week 4 Report). Found myself got blocked at this point:

As to the source data, I think I would agree that we don't really want to rasterize the source geometries because we want to utilize the vector/raster seamless interactions featured and provided by PostGIS Raster. The solution I could think of was to calculate the distance based on the coordinates of the source geometries and the coordinates of the center of the resulted pixel derived from georeferencing info and then create a raster with those distance values as the pixel values.

Now with the feedback from the mentor and the list, I think I got confused about whether we want to avoid creating intermediate raster layer from rasterizing source geometries or not.

I hope I can work this out as soon as possible with the help from the mentor and to proceed on next steps.


   
back to top

Week 6 Report

What did you get done this week?

  • Discussed with mentor,
  • Came up with a preliminary approach briefly discribed as below:

Based on the discussion, I think the way that ST_MapAlgebraFct() works could be the way for computing Euclidean distance and interpolation in terms of creating a new raster based on a SQL function. And the function should be already set like returning the hypotenuse value for Euclidean distance and more complicated functions for interpolation according to different interpolation methods. The problem is to determine to which pixel (the nearest neighbour) the function (formula) should be applied for the current pixel. As we agreed we want to use KNN indexing for this problem.

I come up with an approach which I think should work for a single raster situation where the source points are within the geographic extent of the result raster: First create an empty raster based on the source data (a table of points) as the "container" for the resulted Euclidean distance raster. The raster will have the same georeference as the source data points; The SQL function could be like:

CREATE FUNCTION euclidean_fct(pixel float, pos integer[], source geometry, variadic args text[]) RETURNS float AS $$ BEGIN

SELECT ST_Distance(ST_Point(pos), source) as eucliean_dis ORDER BY $1 ↔ $2 LIMIT 1; Return euclidean_dis;

END; $$ LANGUAGE 'plpgsql';

Then we can just UPDATE new_rast SET rast = ST_MapAlgebraFct(rast,NULL,'euclidean_fct(float,integer[],geometry,text[])'::regprocedure);


Comments from the Mentor:

Why would the point have to be in the raster extent if what you pass to the custom function is the name of a geometry table (or view)?
You don't pass a raster representing the point. How would you do that if you have many point and this is not in accordance with the idea of not converting points to raster.
You pass a raster description based on parameter (similar to ST_AsRaster) or an existing raster and the name of a table (with the schema) and a geometry column (much like in ST_Count but passing a geometry column instead of a raster column) containing the points you want to compute the distance from for every pixels. Those points do not have to be in the raster extent and there can be as many as you want since you will be using the KNN index to find the nearest one.
Generally you won't want to UPDATE an existing raster but rather create a new one based on an existing raster or even better an existing raster coverage. More something like:

CREATE TABLE distcoverage AS
SELECT rid, ST_MapalgebraFct(rast, 'euclidean_fct(list of parameters including the table name and column name of the geometry table)'::regprocedure) rast
FROM myexistingcoveragetable

We should afet encapsulate the ST_Mapalgebra function with a name and parameter list corresponding to a series of ST_EucledianDistance functions (some taking raster creation parameter, some taking a reference raster).


Not sure whether I can pass in a geometry in the function above. The ways that raster and point geometry works are logically pretty similar. Like we can make point geometry from a given coordinate (x,y) pair, pixels in a raster are also organized in a relative coordinate system with the defined height and width of the raster in which the (x,y) pair of coordinate indicates the relative position of this pixel in the raster. My thoughts are when we create the result euclidean raster from the source table of points with the same extent, the points geometry can be then projected into this relative coordinate system of the raster, so that the position of a pixel and the relatively projected coordinate of a point geometry could be exchangeable with or equivalent to each other.

But there will be situation where the source data is like billions of points and the extent would be too large to create a single raster (and the computation will be very inefficient). Then we will have to consider approaches for a tiled raster as the resulted Euclidean surface for the input source.


Comments from the Mentor:

So the trick is to pass a geometry table name instead. That solve the issue of the point having to be in the raster extent (they don't have to). that also make it possible to create a tiled raster coverage perfectly aligned to an existing one...
That approach can be reused to compute interpolation. But is not the way to go for cost distance which is more complicated.


What do you plan on doing next week?


Commented and Suggested by the Mentor:

Write a document summarizing the approach above including, in order:
1- Objective
2-The list of constraints
3-At least three approaches (converting point to raster, creating a TIN first, the approach described here) and the pros and cons for each approaches in relation with the constraints
4-Your choice of approach and the reason why
5-A series of function signature offering flexibility to the user


Document is here: https://docs.google.com/document/d/1GAnvNDHhwUKoCOeY5GIfnAaTqya9I_HOEdotQleExCY/edit


   
back to top

Week 7 Report

What did you get done this week?

  • Disscussed with the Mentor, Came up with a final document of "Creating an Euclidean Distance Raster in PostGIS" with the link:

http://trac.osgeo.org/postgis/wiki/PostGIS_Raster_SoC_Idea_2012/Distance_Analysis_Tools/document

What do you plan on doing next week?

  • Start to work on implementing a plpgsql prototype of the preferred solution in the document
Note: See TracWiki for help on using the wiki.