Opened 11 years ago

Closed 5 years ago

#5130 closed defect (wontfix)

[PATCH] useful speed boost + extra feature for gdal_calc.py

Reported by: gbell Owned by: hobu
Priority: normal Milestone: closed_because_of_github_migration
Component: SWIG (all bindings) Version: 1.10.0
Severity: normal Keywords: gdal_calc speed nodata
Cc: etourigny

Description

Hi all,

gdal_calc.py is a fantastic little utility. I'd like to offer a tiny patch that usefully improves it. A new option is provided in the form of "gdal_calc.py --ManualNoData"

-- Why?

When working with gdal_calc recently, I wanted to handle nodata values in a very specific way in my calculation function depending on which input rasters have nodata. This wasn't possible with the normal gdal_calc.py, so I added a simple non-default alternative execution path and a 'ManualNoData' command line option.

-- When is it beneficial to use this option?

  • when you want to handle nodata manually, e.g. nodata treatment explicitly in your calculation function or in post-processing.
  • when you know there aren't any nodata values in the data and you want better performance.
  • when you don't care about the treatment of nodata values and you want better performance.

-- Benefits?

  • e.g. up to 33% better execution times in the alternative execution path.
  • full manual control of nodata value handling.
  • totally optional - there should be no regressions/ or impact upon existing users.

In my own use case, I've seen speed improvements around 10-15% generally, which saves me e.g. 10-15 minutes CPU time for the large geotiffs I'm working with. I also now have the ability to treat nodata explicitly in a specific way in my calculation function, which is important for me.

-- Example of potential gain, using http://download.osgeo.org/geotiff/samples/usgs/ilatlon.tif :

WITHOUT: time ./gdal_calc.py -A ilatlon_float.tif --calc="A-1" --outfile test.tif real 0m0.318s

WITH: time ./gdal_calc.py -A ilatlon_float.tif --calc="A-1" --outfile test.tif --ManualNoData real 0m0.212s

Gain: 50% speedup, or, 33% less time, depending on how you like your fractions.

The patch is quite straightforward, and is attached below. I hope you'll consider including it in the standard GDAL distribution of gdal_calc.py.

Graeme Bell.

p.s. Permission has been obtained from my employer to share the code under the MIT License. It was produced as part of a project for the Norwegian Forest and Landscape Institute.

PATCH

Index: swig/python/scripts/gdal_calc.py =================================================================== --- swig/python/scripts/gdal_calc.py (revision 26098) +++ swig/python/scripts/gdal_calc.py (working copy) @@ -224,8 +224,9 @@

myY=Y*myBlockSize[1]

# create empty buffer to mark where nodata occurs

  • myNDVs=numpy.zeros(myBufSize)
  • myNDVs.shape=(nYValid,nXValid)

+ if not opts.ManualNoData: + myNDVs=numpy.zeros(myBufSize) + myNDVs.shape=(nYValid,nXValid)

# fetch data for each input layer for i,Alpha in enumerate(myAlphaList):

@@ -236,7 +237,8 @@

win_xsize=nXValid, win_ysize=nYValid)

# fill in nodata values

  • myNDVs=1*numpy.logical_or(myNDVs==1, myval==myNDV[i])

+ if not opts.ManualNoData: + myNDVs=1*numpy.logical_or(myNDVs==1, myval==myNDV[i])

# create an array of values for this block exec("%s=myval" %Alpha)

@@ -252,7 +254,8 @@

# propogate nodata values # (set nodata cells to zero then add nodata value to these cells)

  • myResult = ((1*(myNDVs==0))*myResult) + (myOutNDV*myNDVs)

+ if not opts.ManualNoData: + myResult = ((1*(myNDVs==0))*myResult) + (myOutNDV*myNDVs)

# write data block to the output file BandWriteArray(myOutB, myResult, xoff=myX, yoff=myY)

@@ -276,6 +279,7 @@

parser.add_option("--outfile", dest="outF", default='gdal_calc.tif', help="output file to generate or fill.") parser.add_option("--NoDataValue", dest="NoDataValue", type=float, help="set output nodatavalue (Defaults to datatype specific values)")

+ parser.add_option("--ManualNoData", dest="ManualNoData", action="store_true", help="manage nodata values manually in your calculation formula, can provide a speed improvement")

parser.add_option("--type", dest="type", help="set datatype must be one of %s" % list(DefaultNDVLookup.keys())) parser.add_option("--format", dest="format", default="GTiff", help="GDAL format for output file (default 'GTiff')") parser.add_option(

Attachments (2)

gbb_gdal_calc.patch (2.2 KB ) - added by gbell 11 years ago.
patchfile (got lost on original submission)
gbb_gdal_calc.diff (2.2 KB ) - added by gbell 11 years ago.
svn diff patchfile with correct suffix.

Download all attachments as: .zip

Change History (19)

by gbell, 11 years ago

Attachment: gbb_gdal_calc.patch added

patchfile (got lost on original submission)

by gbell, 11 years ago

Attachment: gbb_gdal_calc.diff added

svn diff patchfile with correct suffix.

comment:1 by gbell, 10 years ago

Type: enhancementdefect

I am marking this now to defect, because although the patch is an enhancement (speed/flexibility) it's also fixing a defective/quirky behaviour.

Today I encountered someone who had been using gdalcalc and could not understand its output. They had been planning to take two files (A) and (B), and then using B to mask A.

They typed A*(A>0) as the mask by mistake instead of A*(B>0), and were completely confused about why the output was being correctly masked by B when B wasn't mentioned in their buggy calculation formula. "How did it know what I meant to write?"

The reason for this strange output was that currently gdalcalc automatically makes the output nodata if *any* of the inputs have a nodata value, even if those inputs are not actually used in the calculation. This is not obvious from documentation, and it took me 15 minutes to guess what was happening despite being the guy that wrote the patch for the very same bug/quirk... in this case it was a fortunate quirk, but it could have been unfortunate just as easily.

comment:2 by Even Rouault, 10 years ago

Cc: etourigny added

Etienne, perhaps you want to take care of this ?

comment:3 by etourigny, 10 years ago

sorry, I'm very busy these days/months...

comment:4 by etourigny, 10 years ago

although it seems the patch is simple, I'll give some thoughts on it.

The argument name is rather confusing, I would rather call it IgnoreNoData.

Also, I wonder what happens with nodata values when you use this flag, it seems they get ignored completely? What would go instead of nodata values?

Also, how would you specify nodata values in the formula (as the name of the flag implies)? Just set certain values to a given value, then use the --NoDataValue to set that value as nodata?

It would be nice if you could provide a new test (in autotest/pyscripts/test_gdal_calc.py) to test this new functionality and make sure it doesn't add any regressions).

comment:5 by gbell, 10 years ago

The argument name is rather confusing, I would rather call it IgnoreNoData?.

I think that suggestion would be a really bad idea, as it would be very misleading to users. The flag does not allow the user to ignore nodata, nor does it cause gdalcalc to 'ignore' nodata cells, it simply requires and enables you to handle nodata combinations manually. The name was chosen to indicate this.

Also, I wonder what happens with nodata values when you use this flag, it seems they get ignored completely? What > would go instead of nodata values?

Gdal_calc has a pessimistic approach of automatically writing out a nodata value if any input of the input rasters contains a nodata value. This is super frustrating if you want to cover these cases (e.g. if nodata in raster C, I don't want to just give up, I want to base my calculation on A and B and D and E instead.). That behaviour cannot be avoided without this patch. The standard behaviour actually introduces subtle bugs that users generally only detect when examining their data and trying to work out why patches are always missing (so far, all of our local users that use gdal_calc have gone through this 'bug learning phase').

So I am noting that for many common problems, gdal_calc's current behaviour is incredibly stupid by forcing a single response to the occurance of nodata anywhere, and this patch corrects it by allowing the user the option to specify manually what to do when no data occurs so they can replace the stupid behaviour with smart behaviour.

So, it's not about about ignoring nodata, it's about not silently carrying out an really annoying and unhelpful default action with it which ignores your gdal_calc formula.

Also, how would you specify nodata values in the formula (as the name of the flag implies)? Just set certain values to a given value, then use the --NoDataValue? to set that value as nodata?

The way we currently use it locally is to specify how to handle the nodata value in the function. This allows us to create different behaviours very easily when e.g. there is a nodata in field A, or field D, or in all fields, or two fields, etc.

It would be nice if you could provide a new test (in autotest/pyscripts/test_gdal_calc.py) to test this new functionality and make sure it doesn't add any regressions).

If I have time, I will look into this. However, considering that the present behaviour is undocumented, and seemingly bizarre/inexplicable/frustrating for users (as per my example), I think you should look into treating it asap.

The patch is in production use with national government dataset production here in norway and has been working perfectly for us since submission 9 months ago.

Graeme.

comment:6 by gbell, 10 years ago

This patch has been waiting for 14 months for review/inclusion. It's a change of only a couple of lines and it fixes a pretty annoying/strange/poorly documented behaviour in gdal_calc.py (I've seen the bug/behaviour affect 2 projects locally already and doubt we're the only ones affected).

Does anyone have time to review it or could it be added to a milestone list so that it doesn't get lost in the void?

comment:7 by Even Rouault, 10 years ago

Summary: useful speed boost + extra feature for gdal_calc.py[PATCH] useful speed boost + extra feature for gdal_calc.py

Graeme,

Etienne did review your patch but you didn't take into account his suggestions. For what is worth, I also support his suggestions :

  • to name the option IgnoreNoData (I'm not an expert of gdal_calc.py but my understanding of your code is that it makes gdal_calc internal algorithm to effectively ignore nodata values).
  • to test the new option in gdal_calc.py

I also think it would be good if you could enhance the documentation of gdal_calc.dox to describe the behaviour without and with the option. The current behaviour looks reasonable to me as a default one. Seems a bit extreme to call it "stupid"...

comment:8 by gbell, 10 years ago

Etienne did review your patch but you didn't take into account his suggestions.

Hi rouault, thanks for replying. I did take into account his suggestions; the problem is that they're factually wrong... as is this comment you've just made, unfortunately: "my understanding of your code is that it makes gdal_calc internal algorithm to effectively ignore nodata values)."

*No*. That is the *opposite* of what it does.

Please read it more carefully after looking at the examples I give below.

The patch disables the *broken* default nodata handler in the algorithm. That does not mean that nodata is ignored. The new code means that complete or partial nodata in the inputs is now handled in the central part of the gdal_calc internal algorithm instead of being skipped/ignored by the (not well-implemented) default handler outside. That means that there is much more flexibility to treat combinations of data/nodata in the inputs in a more sensible way.

I'm going to take the time to explain exactly why it's broken (since it's broken in at least 4 different ways), I would be very grateful if one of you could take the time to read this and have a think about the examples below for a while before replying.

The central problem here is the concept of 'ignoring' and the concept of 'nodata' when you have a) multiple input rasters, some of which may or may not have nodata, b) a calculation function which may or may not refer to all the rasters taken as input ., and optionally c) a need to produce output in different ways depending on how exactly nodata appears among the input rasters. For example, two maps, one with high quality information and partial coverage, the other with low quality information and complete coverage.

For what is worth, I also support his suggestions

I believe that is because you do not understand what the old or new code is actually doing in practice, or the consequences/possibilities. I also think you are both unaware of some types of ordinary GIS work which involves rasters. Please don't be offended by my saying that, because I'm unaware of plenty of things too. I believe you will both change your minds quickly if you review the examples below and think about them.

Traditionally, in a single raster, a nodata value means that that that area of the map is nodata. End of story. This is what most people are used to when they think about rasters.

But when you have multiple input rasters and when you have an function naming some, or all, of those inputs as a mapping to an output raster, it's no longer obvious that one nodata value in one input means the whole cell in the output must therefore be nodata too.

A real example (1): let's say I put 5 files on the input, that I'm considering using as input data (A,B,C,D,E). And let's say I'm prototyping the function, so I change it a around a bit over time to see what the effects are. First I try A+B+C, then I try A+B+C/D, ... and so on.

The first problem is that if the input raster E has some nodata values, then I get missing patches in my output from the function A+B+C. Crucially, this happens *even though E plays no part in the function assigning how output should be produced*. This behaviour is undocumented and frankly, chaotic. Imagine if A,B,C are 98% coverage but E is 99.75% coverage. Suddenly you have a really hard to spot case of unexpected data loss! This happened to us - twice!

If a user writes that they want the output to equal A+B+C, they will not expect the calculator to ignore their function and start adding hidden clauses that also vary the output according to the the data in the unmentioned D and E. That's a disastrous behaviour by the current code.

Example 2: I have 5 files on my input, A,B,C,D,E. This time I want to specify that the raster output contains "2" when there is nodata in B, and "3" where there is nodata in C and '100' where there is nodata in D. In short, I'm classifying nodata!

Can't you see we're not 'ignoring' nodata here? The entire function in this example is using only nodata information. You can't do this in gdal_calc currently. You can only do it with the new code path.

Example 3, again a real world example from many of our projects. I have 5 files on my input, A,B,C,D,E. I write a function, which says that if I have nodata on my first raster (A), well, that's important so I want to put nodata in my output for that case. But let's say that nodata in rasters B or C isn't really a big problem, we only want to use those rasters to fine-tune the classification if the data is available.

Or, let's say that I have 2 rasters, A and B. A is high quality but only partial coverage. B is low quality but total coverage. I want my function to use A+B where it's available, or B otherwise. This is a *completely normal type of activity in GIS*, to classify data in different ways depending on where exactly nodata is present in the inputs, when you have inputs of varying levels of completeness. e.g. I have a 10m DEM that covers the whole country, but 1m information for only some regions. I have that sitation right now in one project, and I had the same situation in another project last year. You cannot do this in the current gdal_calc. I need to give the user the ability to decide how complex or partial nodata should be treated.

By inspection of the current nodata handler you can see that the current nodata handler in gdal_calc produces strange/unpredictable and undocumented output (example 1) and is not suitable for many practical tasks such as prototyping (example 1) or for making decisions based on maps with varying coverage (examples 2/3).

All that this patch is doing is causing the (broken, unpredictable, undocumented, inflexible) default nodata handler to be avoided and the user provided the opportunity to do something more clearly defined and flexible.

So... a correct option name must be something like... "DisableNodataCellSkipping" or "FlexibleNodataHandling" or "ManualNodata". These names all describe accurately what is actually happening and even hint to the user about what they should do next (e.g. they should 'enable something else', 'make use of flexibility' or 'do it manually'). "Ignorenodata" is factually wrong, misleading and further, it gives no indication to the user that they must take another step after enabling the option.

Now, if you don't like ManualNoData, that's fine, but can we at least agree on a name that is 100% accurate rather than -100% accurate, i.e. literally describing the opposite of what the code will do when the option is set?

"I also think it would be good if you could enhance the documentation of gdal_calc.dox to describe the behaviour without and with the option."

If the code patch is accepted for inclusion I agree to write documentation to cover this.

"The current behaviour looks reasonable to me as a default one."

Do you still hold this view after the 3 real-world examples I just gave?

" Seems a bit extreme to call it "stupid"..."

I think when you've lost days of paid programmer/GIS analyst time (not cheap!) on a piece of code that causes unpredictable and undocumented data loss, you will start to think of it differently. However, I retract my use of the word 'stupid' since I have no wish to create more discussion than necessary here.

I propose the current nodata behaviour in gdal_calc is "unpredictable, undocumented, potentially hazardous and completely unsuitable for many types of ordinary GIS work" instead. These adjectives are objective facts based on the examples above, and describe what literally happened to our group... twice.

As for testing; as mentioned the code is in production use and the fault is obvious by code inspection in light of the examples above.

However I can try to find time to add tests but it will require multiple small rasters to be added to the test directory and the test code will have to be simple. I leave for several weeks of holiday tomorrow and I don't have any budgeted time for this when I come back. I propose we put the fix in and documentation and raise a new issue about adding the tests. Alternatively you may already see how to construct simple tests based on examples 1/2/3 above.

That all said : If you both continue to insist that the name must be 'IgnoreNoData' after reading these examples, then I will unfortunately have to withdraw this patch and leave the broken behaviour in place. The nodata handling in gdal_calc is already dangerous/unpredictable today in the distributed GDAL! I will not make a bad situation even worse by putting in a new feature with a name that is 100% the opposite of what it actually does...

Anyway, thanks for your time in reading this. I do appreciate that you have made some effort here and I am sorry if I come across a bit annoyed, I don't mean to be. I am not annoyed with you, only with waiting 14 months to do something about the horrible behaviour in the current GDAL which has damaged two of our projects and cost us plenty of money, and which can be fixed in just 3-5 lines of code.

Oh, one last thing. Besides the other faults, the default handler behaviour is also slow. Even if the user chooses to handle nodata in exactly the same way but does so inside in the main user-defined function, it substantially improves performance. A small benefit, but a very useful one when you're transforming national scale rasters.

comment:9 by Kyle Shannon, 10 years ago

I have a few questions regarding operators on no data. Say you have three rasters, A, B, C. Let's say for a given cell, C is no data. How are the following evaluated:

A + C - B

The operator to apply between A and B is ambiguous if C is no data. Is it A - B or A + B?

A + B / C

Does this become A + B or divide by 0 or something else?

Whatever is applied has to work for all cases, and I'm not sure this will. In my mind, A + B + unkown = unkown, not A + B.

in reply to:  9 comment:10 by Kyle Shannon, 10 years ago

Replying to kyle:

I have a few questions regarding operators on no data. Say you have three rasters, A, B, C. Let's say for a given cell, C is no data. How are the following evaluated:

A + C - B

The operator to apply between A and B is ambiguous if C is no data. Is it A - B or A + B?

A + B / C

Does this become A + B or divide by 0 or something else?

Sorry, to be correct I should have added parentheses:

(A + B) / C

Whatever is applied has to work for all cases, and I'm not sure this will. In my mind, A + B + unkown = unkown, not A + B.

comment:11 by gbell, 10 years ago

Hi Kyle,

Thanks for your comment. Fortunately, there is no need to ever reason about unknowns when writing code to manually handle the treatment of nodata.

The way we treat it is to hide a logical branch in the numpy mathematics as follows.

result= (C==NODATAVALUE)*(formula involving A and B) + (C!=NODATAVALUE)*(formula involving A and B and C).

If C is a nodata value, then the first bracket evaluates to 1, the second to 0, and so the first formula is used for the result.

If C is not a nodata value, the opposite occurs.

It's also possible to use the any() and all() functions for the same purpose. We also use the 'select' numpy function to keep the code easy to read, and avoid building up many 'if's. A bit like using a switch statement in C when you start to have a lot of cases.

This even allows us to extend the idea of nodata to cover 'NODATA, BADDATA, GOODDATA' etc.

We only ever reason about knowns.

Also, the equations above evaluate extremely quickly in practice. We're able to process national maps at 10m scale in a few minutes even when applying a mega-function consisting of 30 different formulas that cover all the different situations and combinations of data/nodata that can arise in our source maps.

We also wrote a compiler which translates an excel spreadsheet directly into numpy algebra so that we can extend our numpy formulas to any degree of complexity without raising the intellectual workload of understanding them or modifying them to alter the output. It works extremely effectively. It's open source: https://github.com/gbb/ruleparser

Universally in our GIS work we have found that when nodata shows up, it needs only a single extra 'branch' in numpy mathematics to address how it should be treated. So this approach doesn't even suffer in terms of branching complexity in practice.

You can also easily characterise how to respond to multiple NODATA sources within a single branch e.g. ((C==NODATAVALUE)+(D==NODATAVALUE)>0)*(correct_output_formula).

There's no question that the ability to treat different cases of nodata manually and say exactly what should happen is an incredibly useful option, we couldn't do our work without it.

But, there's no need to require a flexible solution for people who deal with trivial problems. That's why the proposed patch makes manual nodata handling an extra option, not the default.

Graeme.

comment:12 by Even Rouault, 9 years ago

Note: --IgnoreNoData would be more consistant for example with gdal_contour that has a -inodata that does exactly the same as the proposed patch.

<dt> <b>-inodata</b>:</dt> <dd> Ignore any nodata value implied in the dataset - treat all values as valid.</dd>

comment:13 by gbell, 9 years ago

Thanks for replying.

I see your argument from the viewpoint of consistency.

However, reviewing the examples/argument from before, that's a bug in the naming/documentation.

When you set that option in gdal_counter, you do not:

"ignore any nodata value implied in the dataset"

but you do:

"treat all values as valid"

One path to resolving this might be to reword things a bit in both cases. I think the key problem is the phrase 'nodata values', which to me (and others) refers to values in the data which are set to 'nodata'.

How about if we try this compromise in both cases? (two changes below)

"IgnoreNoData" - description: Ignore any nodata *settings* for the dataset - treat all values as *data*.

What do you think? This is more technically accurate and would allow consistent naming.

comment:14 by Even Rouault, 9 years ago

Your proposed modified description looks good to me.

comment:15 by gbell, 9 years ago

How do we proceed from here?

in reply to:  15 comment:16 by Even Rouault, 9 years ago

Replying to gbell:

How do we proceed from here?

You can prepare an updated patch, including doc and autotest additions.

comment:17 by Even Rouault, 5 years ago

Milestone: closed_because_of_github_migration
Resolution: wontfix
Status: newclosed

This ticket has been automatically closed because Trac is no longer used for GDAL bug tracking, since the project has migrated to GitHub. If you believe this ticket is still valid, you may file it to https://github.com/OSGeo/gdal/issues if it is not already reported there.

Note: See TracTickets for help on using tickets.