Opened 8 years ago

Last modified 5 years ago

#3166 new enhancement

Parallelization with tiling for grass.script

Reported by: wenzeslaus Owned by: grass-dev@…
Priority: normal Milestone: 7.8.3
Component: Python Version: unspecified
Keywords: script, parallel Cc:
CPU: Unspecified Platform: Unspecified

Description

At the same time as r69507, I was working on a simpler approach based on grass.script. At this point, it can do tiling and patching of rasters and partially 3D rasters. Series of commands is executed on each tile. This and also the patching runs in parallel. The syntax is very similar to grass.script with a convenient function for tile naming (which is partially a user responsibility).

Here is an tiling example:

# this is the control object
tiled_workflow = TiledWorkflow(nprocs=4, width=500, height=500, overlap=10)
for namer, workflow in tiled_workflow:
    slope = namer.name('raster', 'slope')
    aspect = namer.name('raster','aspect')

    # now do all as usually, workflow is equivalent of `grass.script`
    workflow.run_command('r.slope.aspect', elevation='fractal_surface',
                         slope=slope)
    workflow.run_command('r.slope.aspect', elevation='fractal_surface',
                         aspect=aspect)
    workflow.parse_command('g.region', flags='pg')

# nothing was actually done till now
# do the parallel processing and patching
results = tiled_workflow.execute()

# iterate over the results (here from g.region)
for result in results:
    for key, value in result.iteritems():
        print key, value

Example using much smaller portion of the API. Creates list of modules which are then executed in the background. When import of the parallel module fails, grass.script is used instead without any changed in the main part of the code.

try:
    from grass.script.parallel import ModuleCallList, execute_by_module
    call = ModuleCallList()
    parallel = True
except ImportError:
    call = gs  # fall back to grass.script
    parallel = False
for i in range(map_min, map_max + 1):
    call.mapcalc(expr, num=i)
if parallel:
    execute_by_module(call, nprocs=4)

The current code uses r.mapcalc for patching and PyGRASS code computing the tiles. One of the main issues with the current code is that it does not finish when there is an error in the executed module.

Attachments (1)

parallel.py (12.2 KB ) - added by wenzeslaus 8 years ago.
Prototype of parallelization with tiling for grass.script

Download all attachments as: .zip

Change History (19)

by wenzeslaus, 8 years ago

Attachment: parallel.py added

Prototype of parallelization with tiling for grass.script

comment:1 by mlennert, 8 years ago

Wow, this looks great !

Could you just explain the relation / difference between this and the GridModule in pygrass ?

comment:2 by huhabla, 8 years ago

I think we can adapt your implementation to use the MultiModule/ParallelModuleQueue approach. The MultiModule class supports the execution of a stack of any GRASS modules in a temporary region environment. Hence instead of implementing a different module executor, you can use pygrass Module and MultiModule to define the processing. Use the ParallelModuleQueue to run the stacks in parallel. You have access to all executed modules and can investigate errors, stdout, stderr and input/output options.

Hence the TiledWorkflow class would accept MutliModule objects and will use the ParallelModuleQueue internally to run the module stacks in parallel. What do you think?

in reply to:  1 comment:3 by wenzeslaus, 8 years ago

Replying to mlennert:

Wow, this looks great !

Note also what is in PyGRASS. Especially after r69507.

Could you just explain the relation

Partial duplication. As in the case of grass.script and grass.pygrass.modules.

difference between this and the GridModule in pygrass ?

  • This supports list of modules (workflow) which are executed subsequently on the given tile.
  • User needs to prepare the list in a for loop (as opposed to not using any for loop). This is because it is in fact derived from the non-tiled parallelization API which is more general, so the user in fact loops over what needs to be done (with or without parallelization).
  • Related to that, the API for simple parallel processing, parallel processing of series of maps, and tiled parallel processing of series of maps is the same.
  • User needs to "help" the functions and objects by providing the with some metadata, i.e. types and names of the maps to patch (for patching), because no module run metadata are available in grass.script.
  • Some of the execution details are lost, e.g. only last command's textual output is preserved.
  • GridModule uses separate mapsets for individual tiles, this uses WIND_OVERRIDE.
  • GridModule uses PyGRASS ctypes wrappers for patching, this (potentially huge) expression r.mapcalc and r3.mapcalc.
  • The interface is like grass.script, not like grass.pygrass.modules.
  • It is not complete.
  • The implementation is now 300 lines. MultiModule alone has 200.
  • It uses from multiprocessing Pool.map_async function (which may be the cause of problem with interrupting).

comment:4 by huhabla, 8 years ago

IMHO, the for-loop to setup the processing commands for the TiledWorkflow can be avoided when using the PyGRASS Module and MultiModule approach. The PyGRASS Module objects allows to alter the input and output settings before running, so that the TiledWorkflow class could take care of the tile names, altering the user pre-configured Module objects. The user simply initiates the Modules that should be used with the original raster names. The PyGRASS Module allows deep copy operation to clone the existing Module objects, hence the TiledWorkflow can create any number of copies and replacing the raster names with tile names.

Please have a look at the PyGRASS Module initialization in t.rast.neighbors: https://trac.osgeo.org/grass/browser/grass/trunk/temporal/t.rast.neighbors/t.rast.neighbors.py#L135

Cloning and adding to the parallel queue: https://trac.osgeo.org/grass/browser/grass/trunk/temporal/t.rast.neighbors/t.rast.neighbors.py#L168

Cite:

The implementation is now 300 lines. MultiModule alone has 200

Well it is not much "Code". The doctests and the description of MultiModule are more than 100 lines. ;)

in reply to:  4 comment:5 by wenzeslaus, 8 years ago

Yes, I would like to reconcile the two APIs or implementations (or both). At this point, I still see too many differences.

Replying to huhabla:

IMHO, the for-loop to setup the processing commands for the TiledWorkflow can be avoided when using the PyGRASS Module and MultiModule approach.

The API with for-loop is actually based on the case where the user wants the for loop like this one:

for i in range(0, 5):
    gs.run_command('r.module', num=i)
    gs.mapcalc(expr, num=i)

I had code like this and I wanted to parallelize the individual loop runs which are independent. So I just come up with the following API which is not changing much in the main part of the code:

workflow = SeriesWorkflow()  # currently called ModuleCallList
for i in range(0, 5):
    workflow.run_command('r.module', num=i)
    workflow.mapcalc(expr, num=i)
workflow.execute()

The Python functions I used in the background have some problems with interrupting and failed subprocesses but they handle well a pool of subprocess so that there is always the given number of processes running (so there can be one really slow process but the others are just running in the mean time).

Then I had a different case, where I didn't have any loop but I needed the tiling. The following API emerged from that:

for namer, workflow in TiledWorkflow(width=100, height=100):
    name = namer.name('rast', i)
    workflow.run_command('r.module', num=name)
    workflow.mapcalc(expr, num=name)
workflow.execute()

This was of course before r69507, but the reasons for similar API are still there because the non-tiled workflow just has the loop anyway (if desired). One argument against current TiledWorkflow would actually be that we want the API to be different from the case where the loop is actually desired by the user.

The PyGRASS Module objects allows to alter the input and output settings before running, so that the TiledWorkflow class could take care of the tile names, altering the user pre-configured Module objects. The user simply initiates the Modules that should be used with the original raster names.

The user (at least me) uses variables anyway. With the SeriesWorkflow case, user names the outputs as needed because all are preserved. With TiledWorkflow the variables needs to be assigned with the help of the TiledWorkflow, so some work is required but not that much.

The PyGRASS Module allows deep copy operation to clone the existing Module objects, hence the TiledWorkflow can create any number of copies and replacing the raster names with tile names.

I don't think it is as simple as replacing the names which is of course possible only with PyGRASS, not grass.script. The naming step in TiledWorkflow simply adds maps for patching. This has potential to handle the case for r.mapcalc expressions as well as some basename usages like from r.texture. I don't have this implemented, but the user could also not include some outputs for patching and mark them for removal instead.

The implementation is now 300 lines. MultiModule alone has 200

Well it is not much "Code". The doctests and the description of MultiModule are more than 100 lines. ;)

Right. I guess my point is that parallel.py mostly relies on higher level functions from Python multiprocessing and on grass.script which is itself simple. Furthermore, parallel.py is more than just TiledWorkflow, although that's the longest and most complicated part. The parallel.py's design is to cover as many cases as possible with minimal code and the cost is that user needs to do something special time to time like the naming step for TiledWorkflow or the use of somehow wrapper functions instead of the real ones (applies to both SeriesWorkflow and TiledWorkflow). However, I think that MultiModule and others are much more robust at this point. parallel.py's only hope for being robust is that it is simple enough to become robust one day.

I hope this clarifies a little bit more where I'm coming from. I know I was not specific in that private email week ago.

comment:6 by mlennert, 8 years ago

I think there is definitely a place for parallel processing functions in grass.script and yours look really great !

[alert: simplification] In my limited observations and own experience I have the feeling that grass.script caters well to the casual, generally functional, scientific programmer who just wants to glue together a specific workflow, whereas pygrass is much more pythonic and, therefore, caters more to those that have a pythonic, more object-oriented, way of thinking.simplification

I guess the question is whether there might be enough common basis between the two implementations to not duplicate, but rather use one as the backend of the other, with the long-term idea of code maintainability ?

comment:7 by mlennert, 8 years ago

What is the status of this ? Is there any documentation outside this ticket ?

in reply to:  7 comment:8 by wenzeslaus, 8 years ago

Replying to mlennert:

What is the status of this ? Is there any documentation outside this ticket ?

For the parallel grass.pygrass API, there are the comments in the code which go to:

For the grass.script.parallel, it's just this ticket, example in the attached code, and usage in r3.count.categories. I haven't had a chance to finalize it or write formal tests, so I don't want to commit the code yet without that.

comment:9 by neteler, 7 years ago

Milestone: 7.4.07.4.1

Ticket retargeted after milestone closed

comment:10 by neteler, 7 years ago

Milestone: 7.4.17.4.2

comment:11 by martinl, 6 years ago

Milestone: 7.4.27.6.0

All enhancement tickets should be assigned to 7.6 milestone.

comment:12 by martinl, 6 years ago

Milestone: 7.6.07.6.1

Ticket retargeted after milestone closed

comment:13 by martinl, 6 years ago

Milestone: 7.6.17.6.2

Ticket retargeted after milestone closed

comment:14 by martinl, 6 years ago

Milestone: 7.6.27.8.0

It seems that grass.script.parallel is not even part of trunk.

grass7_trunk/lib/python/script$ ls *.py
array.py  core.py  db.py  __init__.py  raster3d.py  raster.py  setup.py  task.py  utils.py  vector.py

comment:15 by neteler, 5 years ago

Milestone: 7.8.07.8.1

Ticket retargeted after milestone closed

comment:16 by neteler, 5 years ago

Milestone: 7.8.17.8.2

Ticket retargeted after milestone closed

comment:17 by neteler, 5 years ago

Milestone: 7.8.2

Ticket retargeted after milestone closed

comment:18 by neteler, 5 years ago

Milestone: 7.8.3
Note: See TracTickets for help on using tickets.