Opened 11 years ago

Closed 7 years ago

#520 closed defect (fixed)

r.terraflow/iostream does not respect STREAM_DIR option

Reported by: mmetz Owned by: grass-dev@…
Priority: major Milestone: 6.4.3
Component: Raster Version: svn-releasebranch64
Keywords: r.terraflow, tmp Cc:
CPU: All Platform: All

Description

In r.terraflow, the default directory for temporary files is set to /var/tmp, contrary to GRASS policy. Further on, specifying a temporary directory with option STREAM_DIR is ignored, /var/tmp is used anyway.

For the analysis I want to run, r.terraflow wants "at least 25.12G" (quote from r.terraflow output) of disk space. That much space is not available on the partition where my /var/tmp is. For such cases I have a separate hard drive with several 100 GB of free space and tuned for speed. At least for me, r.terraflow does not work on massive grids as long as it does not use the given STREAM_DIR.

devbr_6 r36206 Linux 64bit

BTW, r.watershed does the job for me in about 15 minutes, peak memory consumption is about 7GB during A* Search, after that less.

Attachments (1)

r.terraflow.diff (1.9 KB) - added by neteler 10 years ago.
Updated again patch for STREAM_DIR parameter usage

Download all attachments as: .zip

Change History (24)

comment:1 Changed 11 years ago by neteler

I got trapped by the same problem yesterday and suggest to use the .tmp in the mapset as default. Then it is in the HOME directory which is typically easier to manage.

comment:2 Changed 10 years ago by neteler

attached patch uses dirname(G_tempfile()) to create tmp dir in mapset instead of /var/tmp/.

Markus

comment:3 in reply to:  2 Changed 10 years ago by glynn

Replying to neteler:

attached patch uses dirname(G_tempfile()) to create tmp dir in mapset instead of /var/tmp/.

It should use either the directory specified by the stream_dir= option or $TMPDIR, so that the user can choose a directory on a partition which has enough space.

comment:4 Changed 10 years ago by neteler

Fixed patch uploaded, please test/comment.

comment:5 Changed 10 years ago by hamish

the bugfix part of it looks good:

-flowStream = new AMI_STREAM<waterWindowBaseType>("/var/tmp/flowStream"); 
+flowStream = new AMI_STREAM<waterWindowBaseType>(streamdir->answer);

For the change of the default placement from /var/tmp/ to $MAPSET/.tmp/, I don't think using G_tempfile() in the option definition section is allowed (ie before G_gisinit() has run), as it could end in G_fatal_error() which requires that. Also the call as written would leave behind an empty tempfile each time you ran the module, even if just --help or --html-description.

if it is changed away from /var/tmp/, docs would need updating too, but I'm not sure if it is a good idea to change the default behaviour in the stable branch.

It has been left in a state "contrary to GRASS policy" due to the large file sizes. there was previous discussion on the mailing list about that some years ago.

tip: if you increase the memory= option to (e.g.) 75% of your system RAM, the needed tempfile space goes down a lot.

Hamish

comment:6 in reply to:  4 Changed 10 years ago by mmetz

Replying to neteler:

Fixed patch uploaded, please test/comment.

r.terraflow is now using grass tmp dir and works, but still not respecting STREAM_DIR option.

comment:7 Changed 10 years ago by neteler

New patch uploaded with bugfix only (so back to /var/tmp/ default). The bug is AFAIK here:

grass/branches/develbranch_6/raster/r.terraflow/main.cc#L179

The assignment to opt seems to fail:

  opt->mem = atoi(mem->answer);
  opt->streamdir = streamdir->answer;
  G_debug(0,"DEBUG streamdir %s",streamdir->answer);
  G_debug(0,"DEBUG opt %s",opt->streamdir);

leads to

GRASS 6.5.svn (spearfish60):> r.terraflow elev=elevation.10m \
    filled=elevation10m.filled     dir=elevation10m.mfdir \
    swatershed=elevation10m.watershed     accumulation=elevation10m.accu \
    tci=elevation10m.tci d8cut=500 memory=800  stats=elevation10mstats.txt\
    STREAM_DIR=/tmp  --o
...
D0/0: DEBUG streamdir /var/tmp/
D0/0: DEBUG opt /var/tmp/
...

Confusing... Interestingly, if I use (note the d typo):

r.terraflow elev= ... STREAM_dDIR=/tmp

the module does not fail. Apparently parse_args() fails?

comment:8 Changed 10 years ago by neteler

Interestingly, when I change

@@ -128,6 +128,6 @@
   /* temporary STREAM path */
   struct Option *streamdir;
   streamdir = G_define_option() ;
-  streamdir->key        = "STREAM_DIR";
+  streamdir->key        = "stream_dir";
   streamdir->type       = TYPE_STRING;

it works!

Question: change to lowercase which breaks the user interface (but uppercase fails)?

comment:9 Changed 10 years ago by neteler

Fixed in GRASS 7, r36632 (stream_dir is already lowercase there). Remains open for GRASS 6.

comment:10 in reply to:  9 ; Changed 10 years ago by mmetz

Replying to neteler:

Fixed in GRASS 7, r36632 (stream_dir is already lowercase there). Remains open for GRASS 6.

Nice!

Just a few comments:

/var/tmp does not exist on Windows, so this is not really a good choice for default temp directory.

It is ok to call G_tempfile() in

streamdir->answer     = G_store(dirname(G_tempfile()));

because this happens in r.terraflow after G_gisinit() is called, not before.

AFAIKT, G_tempfile() only returns a filename, a file is not created (the description of G_tempfile() says so).

Therefore I would opt to change to standard grass policy and use dirname(G_tempfile()) as suggested first by Markus N, because that directory exists always, independent of the OS.

Increasing the memory option to e.g. 75% doesn't really help for massive grids (the main aim of r.terraflow) unless there are much more than 8GB of RAM available (e.g. a cluster with 64GB RAM). Most users don't use a cluster and probably have systems with something between 2GB and 8GB RAM; in these cases the stream_dir is needed, with enough disk space.

Last comment: any reason why STREAMTMP_DIR is set system-wide and not with G_putenv()? Code in question is here:

grass/branches/develbranch_6/raster/r.terraflow/main.cc#L503

Markus M

comment:11 in reply to:  5 ; Changed 10 years ago by adanner

Replying to hamish:

the bugfix part of it looks good:

-flowStream = new AMI_STREAM<waterWindowBaseType>("/var/tmp/flowStream"); 
+flowStream = new AMI_STREAM<waterWindowBaseType>(streamdir->answer);

Hmm. /var/tmp/flowStream is a file but streamdir->answer is supposed to be a directory. "flowStream" should be appended to streamdir->answer. Can you use strcat or is there a G_? function for that?

comment:12 in reply to:  11 Changed 10 years ago by glynn

Replying to adanner:

Hmm. /var/tmp/flowStream is a file but streamdir->answer is supposed to be a directory. "flowStream" should be appended to streamdir->answer. Can you use strcat or is there a G_? function for that?

We normally use sprintf(), e.g.:

    char path[GPATH_MAX];
    ...
    sprintf(path, "%s/flowStream", streamdir->answer);

Changed 10 years ago by neteler

Attachment: r.terraflow.diff added

Updated again patch for STREAM_DIR parameter usage

comment:13 Changed 10 years ago by neteler

Patch updated once more.

comment:14 in reply to:  10 ; Changed 10 years ago by hamish

Replying to neteler:

Question: change to lowercase which breaks the user interface (but uppercase fails)?

that's very weird.

Fixed in GRASS 7, r36632 (stream_dir is already lowercase there). Remains open for GRASS 6.

since it was always broken in the past, maybe it doesn't hurt. On the other hand, changing it will break scripts (which will now work as expected instead of nooping).

Replying to mmetz:

/var/tmp does not exist on Windows, so this is not really a good choice for default temp directory.

that is a problem. If it were decided to keep it out of $MAPSET/.tmp/ we could use a compiler macro for C:\Temp\Grass\, like we do for /dev/null and NUL:.

ISTR the output files are of some interest after the module has ended. is this the case? if so, putting them deep in a hidden .tmp/ dir isn't very nice.

otherwise the lack of /var/tmp/ on MS-win makes a compelling case for moving it to $M/.tmp/ and short-circuiting this discussion.

It is ok to call G_tempfile() in

streamdir->answer     = G_store(dirname(G_tempfile()));

because this happens in r.terraflow after G_gisinit() is called, not before.

sorry, that should have read "before G_parser() has run", not G_gisinit().

see: http://trac.osgeo.org/grass/browser/grass/trunk/lib/gis/parser.c#L724

"The only functions which can legitimately be called before G_parser() are: ..."

AFAIKT, G_tempfile() only returns a filename, a file is not created (the description of G_tempfile() says so).

don't trust the comments, test it. from past experience I think it creates a file. (at least the g.tempfile module does).

Increasing the memory option to e.g. 75% doesn't really help for massive grids (the main aim of r.terraflow) unless there are much more than 8GB of RAM available

I am not sure, but I have the memory that it was not a 1:1 trade off, a bit more memory meant much less disk space? (untested) what I do remember is that when I had do do a real world run with a massive grid, I had to crank that up until I didn't hit the LFS limit of my 32bit system, which in turn made it run faster. I might have also had to compromise on the resolution that time, but I don't think so as I don't like doing that & so would have remembered :)

Last comment: any reason why STREAMTMP_DIR is set system-wide and not with G_putenv()? Code in question is here:

http://trac.osgeo.org/grass/browser/grass/branches/develbranch_6/raster/r.terraflow/main.cc#L503

an environmental variable can not escape its process to the rest of the system (can not export to parent or siblings, only self and children). anyway, the code at lib/gis/putenv.c shows that G_putenv() is just a wrapper incase putenv() doesn't exist on that system.

Hamish

comment:15 in reply to:  14 Changed 10 years ago by mmetz

Replying to hamish:

Replying to mmetz:

/var/tmp does not exist on Windows, so this is not really a good choice for default temp directory.

that is a problem. If it were decided to keep it out of $MAPSET/.tmp/ we could use a compiler macro for C:\Temp\Grass\, like we do for /dev/null and NUL:.

ISTR the output files are of some interest after the module has ended. is this the case? if so, putting them deep in a hidden .tmp/ dir isn't very nice.

All temporary files are deleted upon successful termination, I checked. There is also a warning to remove any temp files in case of error:

THESE INTERMEDIATE STREAMS WILL NOT BE DELETED IN CASE OF ABNORMAL TERMINATION OF THE PROGRAM. TO SAVE SPACE PLEASE DELETE THESE FILES MANUALLY!

otherwise the lack of /var/tmp/ on MS-win makes a compelling case for moving it to $M/.tmp/ and short-circuiting this discussion.

It is ok to call G_tempfile() ...

sorry, that should have read "before G_parser() has run", not G_gisinit().

see: http://trac.osgeo.org/grass/browser/grass/trunk/lib/gis/parser.c#L724

OK, then leave it blank and after G_parser() use G_tempfile() or something else that works on all platforms if no answer was given?

AFAIKT, G_tempfile() only returns a filename, a file is not created (the description of G_tempfile() says so).

don't trust the comments, test it. from past experience I think it creates a file. (at least the g.tempfile module does).

"don't trust the comments", uff. Checking every library function if it behaves as described in the comments is a bit uncomfortable when using these functions in some code. In this case, g.tempfile calls the same function like G_tempfile() and then, after that, actually creates a temp file. So I haven't tested it, but according to the code it looks like G_tempfile() is indeed only returning a file name without creating a file.

Increasing the memory option to e.g. 75% doesn't really help for massive grids (the main aim of r.terraflow) unless there are much more than 8GB of RAM available

I am not sure, but I have the memory that it was not a 1:1 trade off, a bit more memory meant much less disk space? (untested) what I do remember is that when I had do do a real world run with a massive grid, I had to crank that up until I didn't hit the LFS limit of my 32bit system, which in turn made it run faster. I might have also had to compromise on the resolution that time, but I don't think so as I don't like doing that & so would have remembered :)

The manual says that 2x80 bytes per cell are needed as disk space, no mentioning of more memory meaning less disk space needed. It would help if one of the authors of r.terraflow and/or the iostream library could comment on that.

Last comment: any reason why STREAMTMP_DIR is set system-wide and not with G_putenv()?

an environmental variable can not escape its process to the rest of the system (can not export to parent or siblings, only self and children). anyway, the code at lib/gis/putenv.c shows that G_putenv() is just a wrapper incase putenv() doesn't exist on that system.

That is cosmetics anyway, and I think I meant G_getenv/G_setenv, not G_putenv. And I don't know much about environment variables, it was essentially about why not using lib/gis functions for cross-platform compatibility.

Markus M

comment:16 in reply to:  14 Changed 10 years ago by glynn

Replying to hamish:

/var/tmp does not exist on Windows, so this is not really a good choice for default temp directory.

that is a problem. If it were decided to keep it out of $MAPSET/.tmp/ we could use a compiler macro for C:\Temp\Grass\, like we do for /dev/null and NUL:.

A function, e.g. G_temp_dir() would be more useful.

And we really ought to separate out the intended use of G_tempfile() (atomic updates) from its common use (arbitrary temporary files).

ISTR the output files are of some interest after the module has ended. is this the case? if so, putting them deep in a hidden .tmp/ dir isn't very nice.

otherwise the lack of /var/tmp/ on MS-win makes a compelling case for moving it to $M/.tmp/ and short-circuiting this discussion.

Note that the temporary directory also includes the hostname, in case the database is shared via NFS. PIDs are only unique per-host, so a PID alone doesn't guarantee uniqueness for networked filesystems.

It is ok to call G_tempfile() in

streamdir->answer     = G_store(dirname(G_tempfile()));

because this happens in r.terraflow after G_gisinit() is called, not before.

sorry, that should have read "before G_parser() has run", not G_gisinit().

see: http://trac.osgeo.org/grass/browser/grass/trunk/lib/gis/parser.c#L724

"The only functions which can legitimately be called before G_parser() are: ..."

AFAIKT, G_tempfile() only returns a filename, a file is not created (the description of G_tempfile() says so).

Ideally, you shouldn't assume that the "GRASS environment" ($GISRC, VAR, WIND, etc) is valid until after G_parser() has returned. If you want context-dependent defaults, leave ->answer as NULL then assign the value after G_parser() has returned.

Lots of code violates this rule (e.g. modules which use the DBMI typically fill in default values for the database options), but we should be eliminating this rather than adding to it.

comment:17 Changed 10 years ago by hamish

Keywords: r.terraflow tmp added

maybe part of this bug was fixed by r37295 (mid May). From July 2008 G_parser() in 6.4 and 6.5 was silently doing nasty things to option settings in memory for options with upper case letters in them. This went mostly unnoticed in r.terraflow (although some weirdness was noted) as that option is seldom used and it is the only module with an upper case option. a number of addon module users got bitten by this in their own scripts to worse effect.

Hamish

comment:18 in reply to:  17 ; Changed 10 years ago by mmetz

Replying to hamish:

maybe part of this bug was fixed by r37295 (mid May). From July 2008 G_parser() in 6.4 and 6.5 was silently doing nasty things to option settings in memory for options with upper case letters in them.

Not respecting a user-specified STREAM_DIR option was apparently a bug of G_parser(), not r.terraflow. The default answer of /var/tmp or another system TEMP folder is IMHO not a good idea because of lack of space. Why not leaving it blank, if no answer is given, use the GRASS temp directory? That directory should always exist and I guess there is most of the time enough space.

Markus M

comment:19 in reply to:  18 ; Changed 10 years ago by hamish

Replying to mmetz:

Not respecting a user-specified STREAM_DIR option was apparently a bug of G_parser(), not r.terraflow.

actually it was hardcoded, MN just fixed it: r38256, L584

the G_parser() memory bug possibly made the earlier debugging effort very confusing because if the option was left blank the last given value was used anyway(?)

The default answer of /var/tmp or another system TEMP folder is IMHO not a good idea because of lack of space. Why not leaving it blank, if no answer is given, use the GRASS temp directory? That directory should always exist and I guess there is most of the time enough space.

this has been discussed before, check the archives. ISTR that the Terraflow team (Andrew? Laura?) explained why they used that a long time ago (I now forget exact reason), but in the more recent discussion using /home/mapset/.tmp/ had some popular support. Note that these temp files may be >2gig (use the memory= option to lower those needs), and they may be potentially useful(?) for inspection/archive/debugging/reuse later, so keeping them completely hidden may not be desired.

The current %TEMP% solution on ms-windows seems reasonable to me. I do not feel too strongly about the choice of /tmp/, /var/tmp/, or G_tempfile() for GRASS 7 on unix.

Just that if we use a G_tempfile() solution we would be best to have a new G_tempdir() function instead of using possibly non-portable filesystem tricks for that.

you can read some of my earlier bitching and moaning about that here: http://thread.gmane.org/gmane.comp.gis.grass.devel/17750/focus=17762

regards, Hamish

comment:20 in reply to:  19 ; Changed 10 years ago by mmetz

Replying to hamish:

Replying to mmetz:

The default answer of /var/tmp or another system TEMP folder is IMHO not a good idea because of lack of space. Why not leaving it blank, if no answer is given, use the GRASS temp directory? That directory should always exist and I guess there is most of the time enough space.

this has been discussed before, check the archives. ISTR that the Terraflow team (Andrew? Laura?) explained why they used that a long time ago (I now forget exact reason),

Picking a directory whose contents are deleted automatically if I remember right, which would be the case for $MAPSET/.tmp/$MACHINE.

Whatever directory is used as default in r.terraflow, it should at least exist on all supported platforms.

Note that these temp files may be >2gig

I get regularly >20GB, having to work on somewhat larger datasets. Thinking about Doug's datasets, it would go way above 100GB.

(use the memory= option to lower those needs), and they may be potentially useful(?) for inspection/archive/debugging/reuse later, so keeping them completely hidden may not be desired.

These files are deleted upon successful completion of the module using the iostream lib, nothing left to inspect. They are probably only useful for the iostream lib developers for debugging. Same for the segment lib, intermediate files are probably only useful for developers working on the segment lib. I don't see a reason why files created by the iostream library should be treated differently to files created by the segment library, or vice versa, when it comes to their storage location.

This discussion is now moving away from r.terraflow towards where to store large temporary files, maybe this ticket can be closed and the discussion be continued somewhere else if needed?

Markus M

comment:21 in reply to:  20 Changed 10 years ago by neteler

Replying to mmetz:

Replying to hamish:

Replying to mmetz: ISTR that the Terraflow team (Andrew? Laura?) explained why they used that a long time ago (I now forget exact reason),

Most likely because r.terraflow wasn't originally written for GRASS/was written also as stand-alone version.

Picking a directory whose contents are deleted automatically if I remember right, which would be the case for $MAPSET/.tmp/$MACHINE.

Perfect. I assume that 99% of the users will never inspect any tmp file of r.terraflow. So that is a perfect default. If a specialist wants to inspect, then define a different directory +in that case*.

Note that these temp files may be >2gig

I get regularly >20GB, having to work on somewhat larger datasets. Thinking about Doug's datasets, it would go way above 100GB.

Yes, and I always forget to look in /var/tmp/ for undeleted rubbish (but check instead my HOME for delete candidates). Having auto-deletion of tmp stuff left by interrupted/crashed r.terraflow runs when leaving GRASS would be a major advantage for most users.

(use the memory= option to lower those needs), and they may be potentially useful(?) for inspection/archive/debugging/reuse later, so keeping them completely hidden may not be desired.

As said, who wants to analyse these files is also able to change a default setting.

These files are deleted upon successful completion of the module using the iostream lib, nothing left to inspect. They are probably only useful for the iostream lib developers for debugging. Same for the segment lib, intermediate files are probably only useful for developers working on the segment lib. I don't see a reason why files created by the iostream library should be treated differently to files created by the segment library, or vice versa, when it comes to their storage location.

+1

This discussion is now moving away from r.terraflow towards where to store large temporary files, maybe this ticket can be closed and the discussion be continued somewhere else if needed?

No, still /var/tmp/ is the default also in GRASS 7. This should be changed.

MarkusN

Markus M

comment:22 Changed 7 years ago by neteler

Milestone: 6.4.06.4.3
Version: svn-develbranch6svn-releasebranch64

We still have /var/tmp/ as default answer instead of $MAPSET/.tmp/$MACHINE. This should be updated (some g.parser magic will be needed).

comment:23 in reply to:  22 Changed 7 years ago by mmetz

Resolution: fixed
Status: newclosed

Replying to neteler:

We still have /var/tmp/ as default answer instead of $MAPSET/.tmp/$MACHINE. This should be updated (some g.parser magic will be needed).

Fixed in r52749/50.

Markus M

Note: See TracTickets for help on using tickets.