Opened 9 years ago

Closed 8 years ago

#1276 closed defect (fixed)

r.null after v.to.rast in winGRASS does not work properly

Reported by: helena Owned by: grass-dev@…
Priority: critical Milestone: 6.4.2
Component: Raster Version: 6.4.1 RCs
Keywords: r.null, v.to.rast, wingrass Cc:
CPU: Unspecified Platform: MSWindows 7

Description

in the following example, when run in WinGRASS in command console

g.region swwake_30m -p
v.to.rast streets_wake out=streets_speed use=attr col=SPEED
r.null streets_speed null=5

the null values get replaced by 0 instead of 5. The example works OK on Mac and linux. This was tried independently on several Windows 7 machines, always producing wrong result.

Helena

Change History (27)

comment:1 Changed 9 years ago by hamish

Keywords: r.null wingrass added; null removed

comment:2 Changed 8 years ago by hamish

Milestone: 6.4.16.4.2

bug still present in wingrass 6.4.svn r48373 (Sept 2011), tested on Windows XP.

Hamish

comment:3 Changed 8 years ago by hamish

the null values get replaced by 0 instead of 5.

bug still present in wingrass 6.4.2rc3, tested on Windows XP.

very weird.

Hamish

comment:4 Changed 8 years ago by martinl

Priority: normalcritical

comment:5 Changed 8 years ago by hamish

Hi,

a debug message I threw into devbr6 shows that the "5" was making it into the variable ok.

if you convert streets_speed to a DCELL map from a CELL map:

  r.mapcalc "streets_speed.DCELL = double(streets_speed)"

then r.null works as expected.

I strongly suspect the trouble is on this line of r.null/mask.c:

  if (change_null && G_is_null_value(rast, data_type))
	G_set_raster_value_d(rast, new_null, data_type);

new_null is always sent as DCELL, but data_type changes.

(n.b. for CELL maps the command line value for new_null is laundered into something CELL compatible by main.c when it is first read in, but only if you enter a non-integer for null=)

Hamish

comment:6 in reply to:  5 ; Changed 8 years ago by glynn

Replying to hamish:

I strongly suspect the trouble is on this line of r.null/mask.c:

  if (change_null && G_is_null_value(rast, data_type))
	G_set_raster_value_d(rast, new_null, data_type);

new_null is always sent as DCELL, but data_type changes.

That isn't a problem; as the _d suffix indicates, the source value is always DCELL; data_type indicates the type of the destination. The whole point of G_{get,set}_raster_value_{c,f,d} is to convert values between a type which is fixed at compile time and a type which is only known at run time.

My suspicion is that the null bitmap is being updated but the cell/fcell file isn't. Null values always have a zero stored in the cell/fcell file, so clearing the null bitmap without updating the cell/fcell file will replace nulls with zeros. If this is the case, it should be straightforward to confirm it by checking the timestamps on the files.

That this only happens on Windows suggests that it may be related to Windows' different filesystem semantics (e.g. not being able to rename/remove open files). Also, r.null is unusual in that it opens the same map for both reading and writing. It does close the input map before the output map.

If it only happens on Windows 7 and Vista (but not XP): is GISDBASE in a directory which is VirtualStore'd (e.g. Program Files)? That could potentially cause these exact symptoms. The null bitmap is modified in-place, but the cell/fcell files are generated in the temporary directory then rename()d over the original file when closed. This might result in the new cell/fcell file being silently redirected to the VirtualStore directory (%LOCALAPPDATA%\VirtualStore); although I don't know why it wouldn't then be read instead of the original file.

Also, I notice that close_new() (in lib/gis/closecell.c) doesn't examine the return codes from either close() or remove(); it may be worth checking those.

comment:7 Changed 8 years ago by hamish

when testing this (in XP, where it fails too), to reset the map-to-be-r.null'd I had been rerunning with the overwrite flag:

v.to.rast streets_wake out=streets_speed use=attr col=SPEED --o

it happened that I got an error that it could not move the created map from .tmp/ into the main mapset dir, presumably because some part of it was still an open file. I though to save that for another bug report as it seemed to only happen the 4th or 5th time overwriting it, but maybe it is relevant or a sign some more general problem.

Hamish (offline/in the field until next week or so)

comment:8 Changed 8 years ago by hamish

presumably because some part of it was still an open file.

ie I'd guess the map to be overwritten was still open, as the PID in the .tmp file changed but the error persisted once it started.

H

comment:9 in reply to:  7 Changed 8 years ago by glynn

Replying to hamish:

when testing this (in XP, where it fails too), to reset the map-to-be-r.null'd I had been rerunning with the overwrite flag:

v.to.rast streets_wake out=streets_speed use=attr col=SPEED --o

it happened that I got an error that it could not move the created map from .tmp/ into the main mapset dir, presumably because some part of it was still an open file.

That may be a separate issue. Vector commands often access a database for attributes, which means spawning a DBMI driver. There have been problems on Windows which might have been caused by the DBMI driver inheriting a descriptor from the parent, keeping the file open.

That isn't applicable to r.null, which doesn't spawn any child processes.

comment:10 in reply to:  6 ; Changed 8 years ago by hamish

Replying to glynn:

My suspicion is that the null bitmap is being updated but the cell/fcell file isn't. Null values always have a zero stored in the cell/fcell file, so clearing the null bitmap without updating the cell/fcell file will replace nulls with zeros. If this is the case, it should be straightforward to confirm it by checking the timestamps on the files.

From the timestamps (and r.info showing the correct range) I can confirm that cell_misc/$MAP/null and cell_misc/$MAP/range are updated when running r.null on Windows XP, but cell/$MAP's timestamp is not updated.

That this only happens on Windows suggests that it may be related to Windows' different filesystem semantics (e.g. not being able to rename/remove open files). Also, r.null is unusual in that it opens the same map for both reading and writing. It does close the input map before the output map.

...

Also, I notice that close_new() (in lib/gis/closecell.c) doesn't examine the return codes from either close() or remove(); it may be worth checking those.

done in r50404 in devbr6.

bug: close(fd) is called twice, (??)

https://trac.osgeo.org/grass/browser/grass/branches/develbranch_6/lib/gis/closecell.c#L287

Hamish

comment:11 in reply to:  10 Changed 8 years ago by glynn

Replying to hamish:

From the timestamps (and r.info showing the correct range) I can confirm that cell_misc/$MAP/null and cell_misc/$MAP/range are updated when running r.null on Windows XP, but cell/$MAP's timestamp is not updated.

AFAICT, the only way for G_close_cell() to not delete that file and not print a warning is if Gopen_null_write() fails at line 207. But that should result in the null file not being updated (contrary to my previous comment, the null file isn't updated in-place; it's handled the same way as the cell/fcell file).

Otherwise, it should attempt to rename the file, and any failure should generate a warning. There aren't any return statements in close_new() between renaming the null file and renaming the cell/fcell file, so I don't know how this situation can arise.

bug: close(fd) is called twice, (??)

The first one (line 288) should be removed.

comment:12 in reply to:  10 ; Changed 8 years ago by hamish

Replying to glynn:

Also, I notice that close_new() (in lib/gis/closecell.c) doesn't examine the return codes from either close() or remove(); it may be worth checking those.

Replying to hamish:

done in r50404 in devbr6.

hmmm, now r.to.vect (and r.circle, and probably many others) reports a bunch of problems, even on linux:

WARNING: Unable to delete prior null-cells file
WARNING: Unable to delete the temporary null-cells file
WARNING: Unable to delete prior 'fcell' file
WARNING: Unable to delete the f_format file
WARNING: Unable to delete the prior 'cell' file
WARNING: Unable to delete the temporary 'cell' file
WARNING: Unable to delete the f_quant file

but no such warnings appear when running r.null on the latest nightly build of 6.5svn on Windows XP. (& problem persists)

AFAICT, the only way for G_close_cell() to not delete that file and not print a warning is if G__open_null_write() fails at line 207.

aside:

            null_fd = G__open_null_write(fd);
            if (null_fd <= 0)
                return -1;
            if (null_fd < 1)
                return -1;

(G__open_null_write() returns -1 on failure otherwise a new fd for the null file)

more needless duplication. I'm getting the idea that there may be more sloppy code in this file..

bug: close(fd) is called twice, (??)

fixed in devbr6.

Otherwise, it should attempt to rename the file, and any failure should generate a warning. There aren't any return statements in close_new() between renaming the null file and renaming the cell/fcell file, so I don't know how this situation can arise.

maybe this part of the code isn't being entered:

if (ok && (fcb->temp_name != NULL)) {

? I've added some more debug messages, we'll see what they have to say.

Hamish

comment:13 in reply to:  12 ; Changed 8 years ago by glynn

Replying to hamish:

hmmm, now r.to.vect (and r.circle, and probably many others) reports a bunch of problems, even on linux:

WARNING: Unable to delete prior null-cells file
WARNING: Unable to delete prior 'fcell' file
WARNING: Unable to delete the prior 'cell' file
WARNING: Unable to delete the f_format file
WARNING: Unable to delete the f_quant file

These are likely a consequence of the original files not existing.

Any previous null, cell or fcell file is always removed prior to renaming the temporary file into place (on Windows, rename() fails if the destination exists). When writing a new map (rather than replacing an existing map), these files won't exist.

When writing an integer map, any existing fcell, f_format or f_quant file is removed. If it isn't replacing an existing map, or the existing map was an integer map, these files won't exist.

WARNING: Unable to delete the temporary null-cells file
WARNING: Unable to delete the temporary 'cell' file

These probably arise from trying to remove the temporary file after renaming has succeeded. Those remove() calls should either be removed, or moved to the other branch of the conditional (i.e. if rename() fails, remove() the file instead).

However: the first one can also occur if the temporary null file is empty (which indicates an error writing the raster data).

maybe this part of the code isn't being entered:

if (ok && (fcb->temp_name != NULL)) {

Not sure how that could happen. "ok" is true for G_close_cell() and false for G_unopen_cell(). r.null only calls G_unopen_cell() if the loop is terminated prematurely, in which case it prints a warning. fcb->tempname is set to the result of G_tempfile(), which is never null. The assignment is unconditional, and G__open_raster_new() never returns prematurely without generating a warning.

comment:14 in reply to:  13 ; Changed 8 years ago by hamish

Replying to glynn:

Any previous null, cell or fcell file is always removed prior to renaming the temporary file into place (on Windows, rename() fails if the destination exists). When writing a new map (rather than replacing an existing map), these files won't exist.

When writing an integer map, any existing fcell, f_format or f_quant file is removed. If it isn't replacing an existing map, or the existing map was an integer map, these files won't exist.

...

These probably arise from trying to remove the temporary file after renaming has succeeded. Those remove() calls should either be removed, or moved to the other branch of the conditional (i.e. if rename() fails, remove() the file instead).

However: the first one can also occur if the temporary null file is empty (which indicates an error writing the raster data).

rather than just adding a only-remove if it exists (using e.g. stat()) which may hide an error, it will be better to use the branch conditionals. (more work, but worth it)

ok, the annoyance of seeing those warnings all the time will be motivating for the task..

maybe this part of the code isn't being entered:

if (ok && (fcb->temp_name != NULL)) {

Not sure how that could happen.

the new debug message inside that conditional confirms that it is getting into there, so it's not that.

trying 'v.to.rast --overwrite' to replace the cell/ file still gives me problems:

...
WARNING: Unable to delete the f_format file
WARNING: Unable to delete prior 'cell' file   ## (!! but it exists)
WARNING: closecell: can't move [C:/Documents and Settings/User/My
  Documents/.../.tmp/1612.0] to cell file [.../cell/streets_speed]
WARNING: Unable to delete the f_quant file

I wonder if the reason it can't remove the prior 'cell' file and r.null failing are the same. (but why no error message from r.null in that case?)

... next step will be to add 'errno' reporting to the failed remove() prior 'cell' file message while rerunning v.to.rast with --overwrite.

Hamish

comment:15 in reply to:  14 ; Changed 8 years ago by hamish

Replying to hamish:

... next step will be to add 'errno' reporting to the failed remove() prior 'cell' file message while rerunning v.to.rast with --overwrite.

ok, with perror(path) in place I see .../cell/mapname: Permission denied. Checking in the Windows Task Manager I see that a leftover dbf.exe and cmd.exe are still running after v.to.rast has finished. If I kill the dbf.exe the cmd.exe goes away too, then r.null seems to work. I'm not sure why dbf.exe would have the cell file open, I guess it is actually the residual cmd.exe which is doing that. (?)

Interestingly with r.null I just get the perror() message but not the G_warning() from the failed remove(). !?

we saw this same zombie dbf.exe problem before in #1184. In that case Martin discovered db_close_database_shutdown_driver() wasn't called. It looks like v.to.rast has the same problem: hopefully r50487 (devbr6) fixes it.

Hamish

comment:16 in reply to:  15 Changed 8 years ago by hamish

Replying to hamish:

db_close_database_shutdown_driver() wasn't called.

are we going to see this same zombie dbf.exe problem on Windows whenever an exit via G_fatal_error() happens before closing the driver? e.g. because of a typo in the column name or the column exists but its type is wrong for the action? (or 1000 other ways things might fail)

Hamish

comment:17 in reply to:  14 ; Changed 8 years ago by glynn

Replying to hamish:

rather than just adding a only-remove if it exists (using e.g. stat()) which may hide an error, it will be better to use the branch conditionals. (more work, but worth it)

Er, what?

ok, the annoyance of seeing those warnings all the time will be motivating for the task..

The warnings shouldn't have been added in the first place. The remove() is only necessary on Windows; on Unix, rename() will replace the destination if it exists. On either platform, if the remove() fails, the subsequent rename() should also fail, so there's no particular need to check the result of remove() (other than for debugging this particular issue).

trying 'v.to.rast --overwrite' to replace the cell/ file still gives me problems:

This ticket relates to r.null. As previously mentioned, vector modules have their own issues, which belong on a separate ticket.

Replying to hamish:

ok, with perror(path) in place I see .../cell/mapname: Permission denied. Checking in the Windows Task Manager I see that a leftover dbf.exe and cmd.exe are still running after v.to.rast has finished. If I kill the dbf.exe the cmd.exe goes away too, then r.null seems to work.

So this is specifically about running r.null on a map created by a vector module? Does r.null work correctly on maps not created by vector modules? If so, the title should be updated to reflect that.

In that case, the problem is one of the DBMI driver failing to terminate along with its parent. I note that dbrecv_procnum() only returns DB_EOF if dbrecv() returns 0 (indicating EOF). If it returns an error, it will return DB_OK. Try r50496.

comment:18 in reply to:  17 ; Changed 8 years ago by hamish

Keywords: v.to.rast added
Summary: r.null in winGRASS does not work properlyr.null after v.to.rast in winGRASS does not work properly

Replying to glynn:

Er, what?

just that the warning messages need to be better targeted so that they only print in cases when they are meaningful. (and don't try to remove() a file which doesn't exist, etc)

So this is specifically about running r.null on a map created by a vector module? Does r.null work correctly on maps not created by vector modules? If so, the title should be updated to reflect that.

The failing r.null is the symptom, I would guess that g.remove rast for that map would also fail (untested) as the zombie v.to.rast process would still have the cell file open.

In that case, the problem is one of the DBMI driver failing to terminate along with its parent. I note that db__recv_procnum() only returns DB_EOF if db__recv() returns 0 (indicating EOF). If it returns an error, it will return DB_OK. Try r50496.

is r50497 "SF_CLOSE_DESCRIPTOR should actually close the descriptor" related? should it be backported?

Hamish

comment:19 in reply to:  17 ; Changed 8 years ago by hamish

Replying to glynn:

In that case, the problem is one of the DBMI driver failing to terminate along with its parent. I note that dbrecv_procnum() only returns DB_EOF if dbrecv() returns 0 (indicating EOF). If it returns an error, it will return DB_OK. Try r50496.

r50496 causes a dbmi: Protocol error, e.g. with v.info -c mapname.

Hamish

comment:20 in reply to:  18 Changed 8 years ago by glynn

Replying to hamish:

is r50497 "SF_CLOSE_DESCRIPTOR should actually close the descriptor" related? should it be backported?

Maybe. It's possible that the DBMI driver is inheriting the client ends of the pipes (i.e. the write end of the driver's stdin and the read end of its stdout). If that is the case, it would prevent the driver from receiving EOF on stdin when the client dies.

comment:21 in reply to:  19 Changed 8 years ago by glynn

Replying to hamish:

r50496 causes a dbmi: Protocol error, e.g. with v.info -c mapname.

Reverted.

For some reason, I thought that db__recv() was returning the size or error code directly from readn() etc, when it's a boolean status.

comment:22 Changed 8 years ago by hamish

probably unrelated, but fyi,

spawn.c: In function 'parse_argvec':
spawn.c:739: warning: cast from pointer to integer of different size
spawn.c:742: warning: cast from pointer to integer of different size
spawn.c:748: warning: cast from pointer to integer of different size
spawn.c:749: warning: cast from pointer to integer of different size
spawn.c:755: warning: cast from pointer to integer of different size
spawn.c:762: warning: cast from pointer to integer of different size
spawn.c:763: warning: cast from pointer to integer of different size
spawn.c:764: warning: cast from pointer to integer of different size

comment:23 in reply to:  22 Changed 8 years ago by glynn

Replying to hamish:

probably unrelated, but fyi,

spawn.c: In function 'parse_argvec':
spawn.c:739: warning: cast from pointer to integer of different size

Known issue, not actually a problem, no obvious solution.

Casting from pointer to a larger integer type to int would eliminate the warning, but there's no obvious candidate for the larger integer type. C89 has neither intptr_t nor "long long"; in fact, there's no guarantee that a given C89 has any integral type capable of holding a pointer.

In the cases where the cast occurs, the pointer is required to be an "int" which has been cast to a pointer, so casting back to "int" isn't a problem.

comment:24 Changed 8 years ago by hamish

Replying to hamish:

is r50497 "SF_CLOSE_DESCRIPTOR should actually close the descriptor" related? should it be backported?

Replying to glynn:

Maybe. It's possible that the DBMI driver is inheriting the client ends of the pipes (i.e. the write end of the driver's stdin and the read end of its stdout). If that is the case, it would prevent the driver from receiving EOF on stdin when the client dies.

backported to devbr6 in r50529, but no luck.

the driver now being closed (or not opened) the 6.4 and 6.5 nightly wingrass builds allows r.null to work as it should; and the primary subject of this ticket is resolved.

but if the module exits with a G_fatal_error (e.g. if you typo col=speed2 which doesn't exist) the dbf.exe and cmd.exe are left running, even when you exit grass.

Hamish

comment:25 in reply to:  24 ; Changed 8 years ago by glynn

Replying to hamish:

backported to devbr6 in r50529, but no luck.

Can you try r50563?

comment:26 in reply to:  25 Changed 8 years ago by hamish

Replying to glynn:

Replying to hamish:

backported to devbr6 in r50529, but no luck.

Can you try r50563?

backported to devbr6 in r50573, and tested with 6.5svn nightly build r50577, but still no luck. passing v.to.rast a bad column name leading to a G_fatal_error() still leaves dbf.exe and cmd.exe zombie processes behind.

try, try again, Hamish

comment:27 Changed 8 years ago by hamish

Resolution: fixed
Status: newclosed

missing db_close_database_shutdown_driver() added in v.to.rast, allowing r.null to work. closing the bug.

zombie dbf.exe on error problem moved into ticket #1579

Hamish

Note: See TracTickets for help on using tickets.