Opened 16 years ago
Closed 8 years ago
#2411 closed defect (fixed)
gdal_grid inefficiencies
Reported by: | pagameba | Owned by: | dron |
---|---|---|---|
Priority: | normal | Milestone: | |
Component: | Utilities | Version: | svn-trunk |
Severity: | normal | Keywords: | gdal_grid efficiency |
Cc: |
Description
This ticket is to trac my experience with gdal_grid and offer some feedback on my particular use case in which I was not able to get it to produce a DEM in an reasonable amount of time.
I am working with about 1.8 million points of XYZ data that I originally got in about 1800 separate files, each covering a small rectangular region. The data was captured at about 70m intervals or better in some cases, so I was hoping to produce a DEM at about 70m per pixel or better resolution. For my target area, this resulted in an desired output size of about 6000 x 6000 pixels.
My initial attempt was to combine all the XYZ files into a single CSV file and set up a VRT file, then pass this to gdal_grid. When nothing happened for several hours, I started to wonder about what I had attempted. I started trying alternate algorithms each with similar results. Then I took the drastic step of looking at the code and realized that it does an exhaustive loop over all the input points for every cell in the output raster - I roughly calculated this would take several years to run to completion on my machine and started looking for an alternative approach.
The best approach that I attempted was to work from the original data and try to use a 3 x 3 window approach feeding the XYZ values for the target file plus up to 8 surrounding files into gdal_grid to produce a 40x40 output file that would then be combined using gdal_merge into a full DEM. This ran quite a bit faster but still had not completed in 24 hours.
As a secondary issue, I took some of the produced tif files and tried to merge them. gdal_merge complained that the vertical size was negative, at which time I realized that the coordinate system of the output tifs was upside down (vertical only). I went over what I had done with Frank and all seemed reasonable - just vertically flipped output.
For my specific use case, then, gdal_grid could not do the job - fortunately I was able to do it using GRASS in only a few hours. But I believe this indicates that a somewhat different approach to gdal_grid would produce good results in a more reasonable amount of time.
The real issue is, I believe, that the algorithm uses every point from the input for every cell of the output. It seems unlikely that any more than a few points around the cell will have a substantial impact on the final value. Perhaps gdal_grid could be re-organized to make a single pass over the input points and put them into spatial bins that match small rectangular windows in the output grid - perhaps 10x10, then execute the existing algorithms on the 10x10 grids using the window plus up to 8 surrounding windows. If no data is found in (one or any of) the surrounding windows, the search region could be expanded. The size of the window could be configurable or some heuristic used to calculate it based on the number of input points and size of the output raster (so that some points end up in each window while keeping the number of windows minimized to reduce memory overhead).
For your consideration.
Attachments (5)
Change History (19)
comment:1 by , 16 years ago
Cc: | added |
---|---|
Component: | default → Utilities |
Keywords: | gdal_grid added |
Version: | unspecified → 1.5.0 |
comment:2 by , 16 years ago
Cc: | removed |
---|---|
Owner: | changed from | to
comment:3 by , 16 years ago
follow-up: 5 comment:4 by , 16 years ago
Updated version of the patch based on the changes to quad tree done in r15067
by , 16 years ago
Attachment: | gdal_grid_performances_improvement.patch added |
---|
comment:5 by , 15 years ago
Keywords: | efficiency added |
---|---|
Type: | enhancement → defect |
Version: | 1.5.0 → svn-trunk |
Replying to rouault:
Updated version of the patch based on the changes to quad tree done in r15067
I had problems getting this patch to work with CSV files (with an associated VRT), but reverting back to the original Compute Grid Geometry block fixed the issue. Attached is a secondary patch which can be applied from within the gdal/apps directory to update the gdal_grid.cpp file to reflect the new changes. This probably could use a second look by someone more familiar with the overall code, as I have only been focused on getting it to work, and there could be extraneous lines of code contained within this block.
by , 15 years ago
Attachment: | gdal_grid_improvements_patch2.patch added |
---|
Patch to be applied OVER patch(es) on this page.
comment:6 by , 15 years ago
Thumand,
After a bit of testing, I'm not sure why your patch would be necessary but I ran into an issue when testing with VRT/CSV. In my initial attempt, the X and Y column reported by an ogrinfo on your VRT file were of string type and not real type. The consequence of this is that spatial filtering would not work properly as by default it will rely on attribute filtering on the x and y columns of the source layer, but if the columns are of string type, the comparisons will be made on strings rather on numerical values... and "-79" > "-78" when you compare using lexicographic order.
There are many ways of fixing this :
- add useSpatialSubquery="false" at the end of the GeometryField line in the VRT, like : <GeometryField encoding="PointFromColumns" x="x" y="y" useSpatialSubquery="false"/>. This is OK as the CSV has no fast attribute select.
- or add a .CSVT file to indicate that the x and y columns are of type real
- or add a SrcSQL line in the VRT file, such as <SrcSQL>SELECT CAST(x AS float), CAST(y AS float), CAST(z as float) FROM the_name_of_the_csv_layer</SrcSQL>
Doing one of those did fix the issue I had. But your patch alone without one of the above adjustments didn't change anything.
My understanding of your patch is that it is an alternate way of computing the spatial extent of the layer (and potentially most slower on formats that have fast queries for getting spatial extent -> that's the reason why I used OGR_L_GetExtent). There's certainly something I'm missing in the way you're using gdal_grid. Could you post the exact commandline you use, and ideally attach a small .csv and .vrt that demonstrate that your patch is necessary ?
comment:7 by , 15 years ago
This is just a standard x,y,z CSV file with a VRT constructed as explained on the gdal_grid help page. I would argue that it is essential that this work correctly, as it is a common entry point to gdal_grid (and it worked previously). I will attach example files.
After your patch, the result from gdal_grid is this: Grid data type is "Float64" Grid size = (256 256). Corner coordinates = (0.000000 0.000000)-(0.000000 0.000000). Grid cell size = (0.000000 0.000000).
Whereas the 'correct' result (from unaltered gdal_grid or with my patch) is this: Grid data type is "Float64" Grid size = (256 256). Corner coordinates = (467793.173828 4786280.019531)-(471301.826172 4786269.980469). Grid cell size = (13.652344 0.039062).
(In this case, the numbers are meaningless, as they are from this test file, which is a tiny subset of a much larger CSV file.)
I simply took out the OGR_L_GetExtent code and reinserted the 'old' gdal_grid code so that it worked for these files.
Maybe the best is to attempt OGR_L_GetExtent, and if it fails to get a result, to fall back on the original method?
The overall performance improvement is fantastic with the patch.
comment:8 by , 15 years ago
I should add that the gdal_grid command line used for the examples above is: gdal_grid -l test -a nearest test.vrt test.tif
by , 15 years ago
Attachment: | gdal_grid_performances_improvement_v2.patch added |
---|
Patch v2 that takes into account the issue with OGR_L_GetExtent and VRT layers
comment:9 by , 15 years ago
Thumond,
thanks for your inputs. That helped me to realize that I just had to change the bForce argument of OGR_L_GetExtent from FALSE to TRUE... I've updated my patch in v2 with that only change. So your patch is hopefully no longer necessary now.
However to get the maximum performance out of the patch, when working with a OGR file of several million points, I'd advise you to work on a spatially indexed shapefile :
ogr2ogr test.shp test.vrt ogrinfo test.shp -sql "CREATE SPATIAL INDEX ON test" gdal_grid -l test -a nearest test.shp test.tif
comment:10 by , 15 years ago
Arg, I lost the formatting of the commands :
ogr2ogr test.shp test.vrt ogrinfo test.shp -sql "CREATE SPATIAL INDEX ON test" gdal_grid -l test -a nearest test.shp test.tif
comment:12 by , 9 years ago
Patch by Even is not applied to trunk/gdal/alg/gdal_alg.h (GDAL 2.0, 2015-01-04).
comment:13 by , 9 years ago
I do not have the original data that I was working with when I first experienced this problem so I can't really comment on whether this provides an improvement in the case I was working on, but based on the patch it seems to me that it would help significantly.
comment:14 by , 8 years ago
Resolution: | → fixed |
---|---|
Status: | new → closed |
Let's trust in @pagameba "based on the patch it seems to me that it would help significantly".
I'm attaching a patch that improves performance and basically implements pagameba's ideas. The patch must be applied in top of r15052 where CPLQuadTree was added.
I've tested it with a shapefile of 3601 * 3601 points to create a 3601 * 3601 pixel grid. I've build the spatial index on the shapefile before doing gdal_grid. I also ask to create the raster with tiles of dimension 256x256 as it will improve the performance when requesting the points. It run on my PC in around 15 minutes.
A new option, -margin (default = 1), is added to gdal_grid. It means that for each destination pixel, we only apply the gridding algorithm on the points contained in the box [dfXPoint - nMargin * dfDeltaX, dfXPoint + nMargin * dfDeltaX] * [ dfYPoint - nMargin * dfDeltaY, dfYPoint + nMargin * dfDeltaY].
The new algorithm probably requires less memory on huge OGR datasets as we only load the features for each raster block, but the dataset must have fast spatial queries (shapefiles with .qix, postgis) in order that algorithm to be the most efficient.