Opened 9 years ago

Closed 9 years ago

Last modified 9 years ago

#1161 closed defect (invalid)

g.region and r.info decimal issue when using grass python libs

Reported by: isaacullah Owned by: grass-dev@…
Priority: normal Milestone: 6.4.1
Component: Python Version: 6.4.0
Keywords: g.region, precision Cc: cmbarton
CPU: All Platform: All

Description

When running grass.region() and grass.raster_info(), values come out in double precision, which is NOT what you get if you run g.region or r.info, where they come out in single precision. This can cause region misalignments and problems with boolean comparisons in scripts.

>>> grass.run_command('r.info', map = 'BETA_elevation41', flags = 'g')
north=4293588.60267
south=4284488.78578
east=737425.161232
west=723395.443544
0
>>> grass.raster_info('BETA_elevation41')
{'north': 4293588.6026699999, 'timestamp': '"none"', 'min': 387.10700417163201, 'datatype': 'DCELL', 'max': 1339.129374287, 'ewres': 4.9998993900000004, 'vertical_datum': '', 'west': 723395.44354400004, 'units': '', 'title': ' (BETA_elevation41)', 'east': 737425.16123199998, 'nsres': 4.9998993900000004, 'south': 4284488.7857799996}
>>> grass.run_command('g.region', flags = 'gp')
n=4293588.60267
s=4284488.78578
w=723395.443544
e=737425.161232
nsres=4.99989939
ewres=4.99989939
rows=1820
cols=2806
cells=5106920
0
>>> grass.region()
{'rows': 1820, 'e': 737425.16123199998, 'cells': 5106920.0, 'cols': 2806, 'n': 4293588.6026699999, 's': 4284488.7857799996, 'w': 723395.44354400004, 'ewres': 4.9998993900000004, 'nsres': 4.9998993900000004}

Change History (14)

comment:1 Changed 9 years ago by hamish

see also #335: "export floats and doubles with correct precision"

comment:2 in reply to:  description ; Changed 9 years ago by glynn

Resolution: invalid
Status: newclosed

Replying to isaacullah:

When running grass.region() and grass.raster_info(), values come out in double precision, which is NOT what you get if you run g.region or r.info, where they come out in single precision. This can cause region misalignments and problems with boolean comparisons in scripts.

These functions simply parse the (decimal) output from g.region and r.info. Python has printf-like formatting operations if you wish to use them.

comment:3 in reply to:  2 Changed 9 years ago by cmbarton

Replying to glynn:

Replying to isaacullah:

When running grass.region() and grass.raster_info(), values come out in double precision, which is NOT what you get if you run g.region or r.info, where they come out in single precision. This can cause region misalignments and problems with boolean comparisons in scripts.

These functions simply parse the (decimal) output from g.region and r.info. Python has printf-like formatting operations if you wish to use them.

Actually this is not what seems to be happening. g.region and r.info produce single precision values, as expected. But the python library functions do not seem to be getting values from these--or are doing something strange with the values after the fact--in order to come up with double precision values. The result is that the values in the dictionary produced by grass.region() and grass.raster_info() are *different* from the values that come from g.region or r.info. Therein lies the problem.

A region set using g.region is different from a region set using grass.region(). The difference is not much but it is enough to cause problems if you are comparing regions in a boolean way or trying to overlay maps created with a setting in g.region and maps created with a setting from grass.region().

My only guess is that somehow grass.region() is populating its dictionary via a swig/ctype call instead of just parsing g.region. If this guess is wrong, then something else is happening to the values after they are generated by g.region and before they go into the python dictionary.

comment:4 Changed 9 years ago by glynn

Replying to cmbarton:

These functions simply parse the (decimal) output from g.region and r.info. Python has printf-like formatting operations if you wish to use them.

Actually this is not what seems to be happening.

Yes it is.

g.region and r.info produce single precision values, as expected.

g.region and r.info produce decimal values using G_format_{northing,easting}, which uses either %.15g or %.8f (except for lat/lon, which uses DMS). Both of these are better than IEEE single-precision which has between 6 and 7 decimal digits. %.15g uses 15 decimal digits (trailing zeros after the decimal pointer are omitted, as is the decimal point itself if it is not required); %.8f uses as many digits as are required before the decimal point and a further 8 digits after it.

But the python library functions do not seem to be getting values from these--

The Python functions are wrappers around "g.region -g" and "r.info -rgstmpud", which parse the output into a dictionary, with the strings parsed using float(), int() or float_or_dms() as appropriate.

or are doing something strange with the values after the fact

Yes; if by "strange" you mean converting them to (double precision) binary floating point values (which is a lossy operation; 10-n (for n >= 1) isn't exactly representable in binary).

OTOH, that isn't all that strange, given that the values started out as floating point before g.region/r.info converted them to decimal (which itself may be lossy; %.15g isn't quite enough for double precision, which has slightly better than 15 decimal digits of precision).

--in order to come up with double precision values. The result is that the values in the dictionary produced by grass.region() and grass.raster_info() are *different* from the values that come from g.region or r.info. Therein lies the problem.

The values which come from g.region or r.info are strings, each comprising a decimal representation of a number. Most of the things which you might want to do with that information will expect numbers rather than strings, so the Python functions convert them to numbers automatically.

We could use Python's "decimal" package, although that doesn't work with everything, still doesn't necessarily give you the original value, and serves no purpose other than to work around bugs in scripts which expect to be able to perform floating-point comparisons using "==" or (worse still) string comparison. But if someone is making that kind of mistake, they will have far bigger problems.

If you really need the exact output from g.region/r.info, use grass.parse_command() (which will parse key/value output into a dictionary but will leave the values as strings). But don't expect other commands to return identical strings for the same information; there is no one "correct" format string for coordinates.

A region set using g.region is different from a region set using grass.region(). The difference is not much

In the example give, it's around 10 microns. I'm not convinced that there is a single set of geospatial data in existence which genuinely has that accuracy.

but it is enough to cause problems if you are comparing regions in a boolean way

Which is a bug, and not one which will be solved by any changes to the Python library. Any program which parses the output from g.region or r.info will have exactly the same issues.

or trying to overlay maps created with a setting in g.region and maps created with a setting from grass.region().

Even on the largest map, the differences are nowhere near half a cell, which is what would be required to move the sample point into the next cell.

My only guess is that somehow grass.region() is populating its dictionary via a swig/ctype call instead of just parsing g.region.

It's just parsing the output from "g.region -g" via Python's float() operator:

http://trac.osgeo.org/grass/browser/grass/trunk/lib/python/core.py#L525

http://trac.osgeo.org/grass/browser/grass/trunk/lib/python/core.py#L485

If this guess is wrong, then something else is happening to the values after they are generated by g.region and before they go into the python dictionary.

The only "something else" is that g.region() parses the decimal string to a float, and "print" converts it back to a decimal string. Both of these operations are lossy. But then just about anything which you do with a floating-point value is lossy, including parsing the values from the WIND/cellhd file in the first place.

Parsing a decimal string to a floating-point value is inherently lossy. Converting a floating-point value to a decimal isn't inherently lossy but in practice you invariably use far fewer digits than are required for an exact representation, as the exact representation requires roughly 3 times as many digits as are necessary for a unique representation.

comment:5 Changed 9 years ago by cmbarton

Thanks for the explanation. The difference between the g.region representation to the shell and the information in the python dictionary does seem a result of lossy conversion (e.g., 4.9998993900000004).

The reason we have run into to this is due to using GRASS outside the normal (and now pythonized) GRASS environment. Our Java-based ABM of farming households is running GRASS modules like g.region to get and change region settings. However, it is also running Python-based GRASS scripts that run surface process models and related routines. The latter use the GRASS Python library functions.

Maybe there is no solution to this, but it seems that g.region and grass.region() *ought* to be returning exactly the same values. At least, without knowing what you have explained, the normal assumption is that these would match.

If we have Java (or perhaps our Python scripts) round grass.region() values to the maximum number of significant digits output to the shell, do you think we could be assured that the values would match exactly? If so, what about making this an optional argument (i.e, match g.region shell output) for grass.region(), grass.raster_info(), and other Python library functions that return region boundaries?

Michael

comment:6 in reply to:  5 ; Changed 9 years ago by glynn

Replying to cmbarton:

Thanks for the explanation. The difference between the g.region representation to the shell and the information in the python dictionary does seem a result of lossy conversion (e.g., 4.9998993900000004).

How did you determine what is really in the Python dictionary? Bear in mind that displaying the value in decimal typically performs another lossy conversion, and Python's decimal formatting behaviour is known to be suboptimal (i.e. it will typically use too many decimal places). E.g.:

> x = 5.1
> x
5.0999999999999996
> str(x)
'5.1'
> repr(x)
'5.0999999999999996'

Note that the longer version is probably closer to the exact decimal representation of the binary floating-point value, while the shorter version is probably the shortest decimal value which would convert to the given binary floating-point value.

Or, more relevant to the original report:

> north = 4293588.60267
> north
4293588.6026699999
> str(north)
'4293588.60267'
> repr(north)
'4293588.6026699999'
> '%f' % north
'4293588.602670'
> '%.5f' % north
'4293588.60267'

Maybe there is no solution to this, but it seems that g.region and grass.region() *ought* to be returning exactly the same values. At least, without knowing what you have explained, the normal assumption is that these would match.

g.region returns strings; grass.script.region() returns numbers (specifically, Python "float"s, which on any modern system will be IEEE double-precision floating-point values). If you're going to be using those values for anything, they will get converted to floating-point at some point, and it's a safe bet that you will get exactly the same result regardless of whatever performs the conversion.

If we have Java (or perhaps our Python scripts) round grass.region() values to the maximum number of significant digits output to the shell, do you think we could be assured that the values would match exactly? If so, what about making this an optional argument (i.e, match g.region shell output) for grass.region(), grass.raster_info(), and other Python library functions that return region boundaries?

Given that coordinates are limited by the circumference of the earth, both %.15g and %.8f should be converted without loss (i.e. adjacent values should have distinct floating-point representations), so if you format as decimal with the correct number of digits, you should get the original value. This has to be done when you convert to decimal, though; you can't "tag" floating-point values with formatting options.

If you're comparing floating-point values, you should be comparing to within some tolerance. I'd suggest ten microns to allow for %.6f (which is the default for %f if no precision is given).

On x86, it's impractical to perform exact comparisons for most floating point values, as the CPU uses 80 bits ("long double") internally but only 64 bits if the value is stored in a "double", which means that "x = y; if (x == y) ..." can be false.

comment:7 Changed 9 years ago by cmbarton

Thanks again. We can follow this to ensure the matches within Java. I just wondered if this would be better accomplished within grass.region(). This returns a dictionary (seen in our original post). I can't say exactly where the lossy conversion is happening--going into the dictionary, coming out of the dictionary, or at some other point. It's clear from your explanation that the representation of these values can change at various places. This all makes sense. But still most users would expect grass.run_command('g.region', flags='gp') and grass.region() to give the same results.

comment:8 in reply to:  6 ; Changed 9 years ago by hamish

Replying to glynn:

Given that coordinates are limited by the circumference of the earth, both %.15g and %.8f should be converted without loss (i.e. adjacent values should have distinct floating-point representations), so if you format as decimal with the correct number of digits, you should get the original value. This has to be done when you convert to decimal, though; you can't "tag" floating-point values with formatting options.

note that for lat/lon you'll need to go to %.12g or so.

I guess what I find weird is that g.region will never report with precision > .15g, which is exactly representable & will never saturate the double-prec bitspace, and so the python repr() example above is perhaps using %.16f internally?

 echo 5.1 | awk '{printf("%.16f\n", $1)}'

?, Hamish

comment:9 Changed 9 years ago by hamish

Summary: g.region and r.info decimel issue when using grass python libsg.region and r.info decimal issue when using grass python libs

aka printing the "lossy" version at %.15g might give you the "exact" version you'd expect, even if in practice it is just a human readability thing.

comment:10 in reply to:  8 ; Changed 9 years ago by hamish

Keywords: g.region precision added

Replying to hamish:

I guess what I find weird is that g.region will never report with precision > .15g, which is exactly representable

well, no, as in the "5.1" example. but reproducible anyhow..

& will never saturate the double-prec bitspace, and so the python repr() example above is perhaps using %.16f internally?

anyway, if it matters python's "print" gives it back to you in the form you were expecting,

>>> n = 4293588.60267
>>> n
4293588.6026699999
>>> print n
4293588.60267

Hamish

ps- do you mind if I move LandDyn/? in addons svn to raster/LandDyn?

comment:11 in reply to:  10 Changed 9 years ago by cmbarton

Cc: cmbarton added

Replying to hamish:

Replying to hamish:

I guess what I find weird is that g.region will never report with precision > .15g, which is exactly representable

well, no, as in the "5.1" example. but reproducible anyhow..

& will never saturate the double-prec bitspace, and so the python repr() example above is perhaps using %.16f internally?

anyway, if it matters python's "print" gives it back to you in the form you were expecting,

>>> n = 4293588.60267
>>> n
4293588.6026699999
>>> print n
4293588.60267

Hamish

ps- do you mind if I move LandDyn/? in addons svn to raster/LandDyn?

That's fine. But we need to copy Isaac, as he is maintaining this.

Michael

comment:12 in reply to:  7 ; Changed 9 years ago by glynn

Replying to cmbarton:

Thanks again. We can follow this to ensure the matches within Java. I just wondered if this would be better accomplished within grass.region(). This returns a dictionary (seen in our original post). I can't say exactly where the lossy conversion is happening--going into the dictionary, coming out of the dictionary, or at some other point.

To be honest, I think that you're getting confused by Python's repr() (which is what's used to display the result when you enter an expression at the interactive prompt) using excess precision. Converting from decimal to floating-point results in an approximation, then repr() shows the approximated value exactly.

It's clear from your explanation that the representation of these values can change at various places. This all makes sense. But still most users would expect grass.run_command('g.region', flags='gp') and grass.region() to give the same results.

They can't give the same results. One returns strings, the other numbers. These can never be the same thing:

> 4293588.60267 == '4293588.60267'
False

Even if you use you use run_command('g.region') to get the strings, as soon as you pass those strings to any GRASS module, they're going to be converted to floating-point.

comment:13 in reply to:  12 ; Changed 9 years ago by cmbarton

Replying to glynn:

Replying to cmbarton:

Thanks again. We can follow this to ensure the matches within Java. I just wondered if this would be better accomplished within grass.region(). This returns a dictionary (seen in our original post). I can't say exactly where the lossy conversion is happening--going into the dictionary, coming out of the dictionary, or at some other point.

To be honest, I think that you're getting confused by Python's repr() (which is what's used to display the result when you enter an expression at the interactive prompt) using excess precision. Converting from decimal to floating-point results in an approximation, then repr() shows the approximated value exactly.

It's clear from your explanation that the representation of these values can change at various places. This all makes sense. But still most users would expect grass.run_command('g.region', flags='gp') and grass.region() to give the same results.

They can't give the same results. One returns strings, the other numbers. These can never be the same thing:

> 4293588.60267 == '4293588.60267'
False

Even if you use you use run_command('g.region') to get the strings, as soon as you pass those strings to any GRASS module, they're going to be converted to floating-point.

What I was thinking about was whether it is worth it or not to have an option to do a string conversion of these within the python library function so that the resulting dictionary has values that match the results (also string) of g.region. Within the GRASS environment, including Python, it doesn't really make any difference, since it is clear that str(extent) or print(extent) will produce the same result as g.region.

The place where it could be helpful is in in a case like we have--where a Python script [using grass.region()] is being called and parsed by Java and g.region is being called and parsed by Java. Maybe the Java equivalent of the Python str() will also give the same answer or maybe it won't. This goes for other languages. I simply don't know--though we will run some tests to see. Perhaps this is a rare enough case, that it isn't worth the trouble to provide this option.

comment:14 in reply to:  13 Changed 9 years ago by glynn

Replying to cmbarton:

What I was thinking about was whether it is worth it or not to have an option to do a string conversion of these within the python library function so that the resulting dictionary has values that match the results (also string) of g.region.

Those functions are intended for the situation where you want to actually use the values (e.g. perform arithmetic on them, which you can't do with strings).

If you want strings, you may as well just use grass.parse_command('g.region', ...) etc. parse_command() parses key-value output into a dictionary, but doesn't parse the values further unless specifically instructed via the val_type argument.

The place where it could be helpful is in in a case like we have--where a Python script [using grass.region()] is being called and parsed by Java and g.region is being called and parsed by Java. Maybe the Java equivalent of the Python str() will also give the same answer or maybe it won't.

Why would the Java code be converting them to strings? Assuming that you pass it strings via the command line, I would expect it to just convert them to float/double. Any exact comparison is fragile, and an exact comparison using strings even more so (e.g. 1.2 and 1.20 should compare equal as floats but the strings "1.2" and "1.20" won't compare equal).

This goes for other languages. I simply don't know--though we will run some tests to see. Perhaps this is a rare enough case, that it isn't worth the trouble to provide this option.

It isn't worth the trouble in any case. Code which performs equality comparisons on coordinates should itself be fixed, rather than expecting other code to be modified in order to satisfy its (incorrect) assumptions. You should never perform equality comparisons on floats unless you know that the values are exactly representable. Even then, you need to be careful about optimisations such as -funsafe-math-optimizations (and broken compilers; Borland C always performs such optimisations, with no means to disable them).

You really shouldn't rely upon anything preserving floating-point values exactly, not just conversions to/from decimal. E.g. even without changing the WIND file, the values g.region gives for nsres/ewres can change depending upon the compiler and compilation switches used. Code which works fine on one architecture can break on a different architecture, e.g. due to the 80-bit internal precision of x86, default rounding mode, etc.

Note: See TracTickets for help on using tickets.