Opened 6 years ago

Closed 3 years ago

Last modified 3 years ago

#2045 closed enhancement (fixed)

r.to.vect: use less memory

Reported by: mlennert Owned by: grass-dev@…
Priority: normal Milestone: 7.0.5
Component: Raster Version: svn-trunk
Keywords: r.to.vect Cc:
CPU: Unspecified Platform: Unspecified

Description

Trying to convert a raster file to vector areas on a machine with 10GB, the process was killed after having used up all memory and swap.

The CELL raster in question is an output of i.segment, is quite large and has many small segments, i.e. many different raster values:

Rows: 53216
Columns: 49184
Total Cells:  2617375744

Although many of these cells are null:

total null cells: 2061717280
non-null cells: 555658464

It would be nice if r.to.vect could handle such large files without using up so much memory.

Change History (26)

comment:1 in reply to:  description ; Changed 6 years ago by mmetz

Replying to mlennert:

Trying to convert a raster file to vector areas on a machine with 10GB, the process was killed after having used up all memory and swap.

There was a memory leak in r.to.vect, fixed in trunk r57281.

comment:2 in reply to:  1 ; Changed 6 years ago by mlennert

Replying to mmetz:

Replying to mlennert:

Trying to convert a raster file to vector areas on a machine with 10GB, the process was killed after having used up all memory and swap.

There was a memory leak in r.to.vect, fixed in trunk r57281.

(and r57281)

Thanks ! I now get through the "Extracting areas..." part without the process being killed. I still see continuous increase in memory usage, though, up to 88,5% at the end of the extraction stage. Is this normal ?

Now it's busy writing areas...

comment:3 in reply to:  2 ; Changed 6 years ago by mlennert

Replying to mlennert:

Replying to mmetz:

Replying to mlennert:

Trying to convert a raster file to vector areas on a machine with 10GB, the process was killed after having used up all memory and swap.

There was a memory leak in r.to.vect, fixed in trunk r57281.

(and r57281)

Thanks ! I now get through the "Extracting areas..." part without the process being killed. I still see continuous increase in memory usage, though, up to 88,5% at the end of the extraction stage. Is this normal ?

Now it's busy writing areas...

And memory usage continued to increase. The process was killed during the "Registering primitives..." step.

comment:4 in reply to:  3 ; Changed 6 years ago by mmetz

Replying to mlennert:

Replying to mlennert:

Replying to mmetz:

Replying to mlennert:

Trying to convert a raster file to vector areas on a machine with 10GB, the process was killed after having used up all memory and swap.

There was a memory leak in r.to.vect, fixed in trunk r57281.

(and r57281)

Thanks ! I now get through the "Extracting areas..." part without the process being killed. I still see continuous increase in memory usage, though, up to 88,5% at the end of the extraction stage. Is this normal ?

Now it's busy writing areas...

And memory usage continued to increase. The process was killed during the "Registering primitives..." step.

You can try to set the environment variable GRASS_VECTOR_LOWMEM before running r.to.vect. GRASS_VECTOR_LOWMEM reduces the amount of memory used by vector topology structures (the spatial index is built on disk).

IOW, r.to.vect might use quite a bit of memory which is difficult to change, and the vector topology structures also need memory (in g7 much less than in g6).

comment:5 in reply to:  4 Changed 6 years ago by mlennert

Replying to mmetz:

Replying to mlennert:

Replying to mlennert:

Replying to mmetz:

Replying to mlennert:

Trying to convert a raster file to vector areas on a machine with 10GB, the process was killed after having used up all memory and swap.

There was a memory leak in r.to.vect, fixed in trunk r57281.

(and r57281)

Thanks ! I now get through the "Extracting areas..." part without the process being killed. I still see continuous increase in memory usage, though, up to 88,5% at the end of the extraction stage. Is this normal ?

Now it's busy writing areas...

And memory usage continued to increase. The process was killed during the "Registering primitives..." step.

You can try to set the environment variable GRASS_VECTOR_LOWMEM before running r.to.vect. GRASS_VECTOR_LOWMEM reduces the amount of memory used by vector topology structures (the spatial index is built on disk).

IOW, r.to.vect might use quite a bit of memory which is difficult to change, and the vector topology structures also need memory (in g7 much less than in g6).

So you think the continuous increase up to 88.5% during the area extraction is normal ? I'll try the GRASS_VECTOR_LOWMEM option on Monday.

Just brainstorming:

I'm not sure I completely understand the module's workings, but IIUC, it systematically goes through rows an columns and checks whether pixel values change. But does that mean it reads all areas into memory before writing them ? Would it be feasible / of interest to write each area immediately after its boundaries have been identified and then free the memory again ?

I could even imagine a solution where the module works with segments of the map, writes out the areas identified in a segment, then goes on to the next segment and then at the end use a v.dissolve equivalent to fusion all neighboring areas with identical values. Unless the v.dissolve process takes a lot of memory, I could imagine that at least the first part could work with a low memory consumption, or ?

comment:6 Changed 5 years ago by neteler

Keywords: memory removed

Another leak was fixed in r.to.vect (r58795). Can you please retry?

comment:7 Changed 5 years ago by Madi

Hi,

reported also here: http://lists.osgeo.org/pipermail/grass-user/2015-April/072315.html

I tried to vectorize (r.to.vect) a very large raster: Total Cells: 4817298170

The option GRASS_VECTOR_LOWMEM slowed down the computation, however eventually it yielded Killed0000. The topology was not built correctly.

Not sure if this is a bug report rather than enhancement wish?

comment:8 Changed 5 years ago by martin

Hah, I tried to vectorize all of NLCD2011, which is 16832104560 cells in size, in one rush and found this to be a highly impracticable approach. The process (GRASS 7.1 trunk of late 2014) ran out of memory on a machine with more than 100 GByte of main mem and GRASS_VECTOR_LOWMEM didn't make any substantial difference.
Finally I decided to cut the job into many small tiles of approx. 33M cells each (which represents approx. 2x2 degrees in North American AEA) and this turned out to work pretty well - not only in the vectorization but also in the subsequent vector post-processing stage.

See here for a sample of the the result:

http://mapserver.flightgear.org/map/?lon=-99.93544&lat=34.62565&zoom=6

comment:9 Changed 4 years ago by martinl

Milestone: 7.0.07.0.5

comment:10 Changed 3 years ago by mlennert

A colleague who was trying to vectorize +/- 13 mio segments (output of i.segment) just ran into the same problem of memory saturation during the topology creation phase.

The tiling approach is interesting, we will look into that, but it would be nice if a possibility was found to make the whole r.to.vect process more efficient. I have no idea, how, though...

comment:11 in reply to:  10 ; Changed 3 years ago by mmetz

Replying to mlennert:

A colleague who was trying to vectorize +/- 13 mio segments (output of i.segment) just ran into the same problem of memory saturation during the topology creation phase.

As a workaround, you could use r.to.vect -b and then build topology separately with v.build.

The tiling approach is interesting, we will look into that,

See also the GRASS addon r.to.vect.tiled [0]

but it would be nice if a possibility was found to make the whole r.to.vect process more efficient. I have no idea, how, though...

The module already writes out a boundary as soon as it has been completed and attempts to free memory used to track this boundary. I suspect more serious memory leaks in r.to.vect during the extraction phase. Even though the module tries to free all memory used during extraction, it seems that some allocated memory is no longer accessible. I have an idea where to look for the memory leak (update_list()), but it will not be a quick fix.

[0] https://grass.osgeo.org/grass70/manuals/addons/r.to.vect.tiled.html

comment:12 in reply to:  11 ; Changed 3 years ago by mmetz

Replying to mmetz:

The module already writes out a boundary as soon as it has been completed and attempts to free memory used to track this boundary.

The attempt was unsuccessful, also for lines, fixed in r68716. According to valgrind, I do not see any more memory leaks in r.to.vect, and all memory used in the extraction part is freed before topology building begins. Please test, preferably with r.to.vect -b.

comment:13 in reply to:  12 ; Changed 3 years ago by mlennert

Replying to mmetz:

Replying to mmetz:

The module already writes out a boundary as soon as it has been completed and attempts to free memory used to track this boundary.

The attempt was unsuccessful, also for lines, fixed in r68716.

That helps a lot !

Without (i.e. grass70) my machine became unusable for a while and then the module execution stopped after a bit more than 5 minutes because of too little memory. With the fix (grass73) it went all the way to 100% of writing areas (17 minutes on my machine), but then I got an error "Category index is not up to date".

I used r.to.vect -vb as I want to preserve category values of the raster file.

However, my colleague's problem was not in the vectorization stage, but in the topology building stage. I'm currently testing with -vt and VECTOR_LOW_MEM=1. It's been running for about 1 h. I'll see tomorrow what the result is...

I'll also try with -vb, followed by v.build.

comment:14 in reply to:  13 ; Changed 3 years ago by mmetz

Replying to mlennert:

Without (i.e. grass70) my machine became unusable for a while and then the module execution stopped after a bit more than 5 minutes because of too little memory. With the fix (grass73) it went all the way to 100% of writing areas (17 minutes on my machine), but then I got an error "Category index is not up to date".

The -v and -b flags are mutually exclusive. The -v flag is disabled for certain input/flag combinations, but this test was missing. Added in trunk r68720.

However, my colleague's problem was not in the vectorization stage, but in the topology building stage. I'm currently testing with -vt and VECTOR_LOW_MEM=1. It's been running for about 1 h. I'll see tomorrow what the result is...

The topology building stage might need a lot of memory. If a lot of memory is still in use because of a memory leak in the extraction stage, the topology building stage is more likely to fail because of an out-of-memory error. There is no guarantee that topology building will succeed, but the chances for success are higher now that the (hopefully last) memory leak in the extraction stage has been fixed. If topology building still fails with an out-of-memory error, try again with the environment variable GRASS_VECTOR_LOWMEM set.

comment:15 in reply to:  14 ; Changed 3 years ago by mlennert

Replying to mmetz:

Replying to mlennert:

Without (i.e. grass70) my machine became unusable for a while and then the module execution stopped after a bit more than 5 minutes because of too little memory. With the fix (grass73) it went all the way to 100% of writing areas (17 minutes on my machine), but then I got an error "Category index is not up to date".

The -v and -b flags are mutually exclusive. The -v flag is disabled for certain input/flag combinations, but this test was missing. Added in trunk r68720.

Ah, ok. Why is topology necessary for category values ? Is it because areas are only defined via topological connection between centroids and boundaries ?

This is a pity for us: if you want to use, for example, i.segment results both as raster and vector, you need to keep the link between the two via the segment ids. As attribute table handling really slows things down, -vt is the preferred way to create a vector map with polygons that have the same category as the segment ids in the raster maps...

However, my colleague's problem was not in the vectorization stage, but in the topology building stage. I'm currently testing with -vt and VECTOR_LOW_MEM=1. It's been running for about 1 h. I'll see tomorrow what the result is...

The topology building stage might need a lot of memory. If a lot of memory is still in use because of a memory leak in the extraction stage, the topology building stage is more likely to fail because of an out-of-memory error. There is no guarantee that topology building will succeed, but the chances for success are higher now that the (hopefully last) memory leak in the extraction stage has been fixed. If topology building still fails with an out-of-memory error, try again with the environment variable GRASS_VECTOR_LOWMEM set.

I did use GRASS_VECTOR_LOWMEM=1, but the process was stopped. No explicit memory error, but AFAIK such stopping of a process happens in case of memory errors:

Enregistrement des primitives ...
63282475 primitives registered
397657832 vertices registered
Construction des surfaces ...
Processus arrêté

real    106m12.306s
user    28m2.664s
sys     68m42.260s

Next try will be to use v.build on the vector created by r.to.vect -b...

comment:16 in reply to:  15 ; Changed 3 years ago by mlennert

Replying to mlennert:

Replying to mmetz:

Replying to mlennert:

Without (i.e. grass70) my machine became unusable for a while and then the module execution stopped after a bit more than 5 minutes because of too little memory. With the fix (grass73) it went all the way to 100% of writing areas (17 minutes on my machine), but then I got an error "Category index is not up to date".

The -v and -b flags are mutually exclusive. The -v flag is disabled for certain input/flag combinations, but this test was missing. Added in trunk r68720.

Ah, ok. Why is topology necessary for category values ? Is it because areas are only defined via topological connection between centroids and boundaries ?

This is a pity for us: if you want to use, for example, i.segment results both as raster and vector, you need to keep the link between the two via the segment ids. As attribute table handling really slows things down, -vt is the preferred way to create a vector map with polygons that have the same category as the segment ids in the raster maps...

However, my colleague's problem was not in the vectorization stage, but in the topology building stage. I'm currently testing with -vt and VECTOR_LOW_MEM=1. It's been running for about 1 h. I'll see tomorrow what the result is...

The topology building stage might need a lot of memory. If a lot of memory is still in use because of a memory leak in the extraction stage, the topology building stage is more likely to fail because of an out-of-memory error. There is no guarantee that topology building will succeed, but the chances for success are higher now that the (hopefully last) memory leak in the extraction stage has been fixed. If topology building still fails with an out-of-memory error, try again with the environment variable GRASS_VECTOR_LOWMEM set.

I did use GRASS_VECTOR_LOWMEM=1, but the process was stopped. No explicit memory error, but AFAIK such stopping of a process happens in case of memory errors:

Enregistrement des primitives ...
63282475 primitives registered
397657832 vertices registered
Construction des surfaces ...
Processus arrêté

real    106m12.306s
user    28m2.664s
sys     68m42.260s

Next try will be to use v.build on the vector created by r.to.vect -b...

> export GRASS_VECTOR_LOWMEM=1
> time v.build seg_full
Construction de la topologie pour la carte vectorielle
<seg_full@rtovecttest>...
Enregistrement des primitives ...
63282475 primitives registered
397657832 vertices registered
Construction des surfaces ...
Processus arrêté

real    84m57.309s
user    19m8.872s
sys     54m37.868s

:-(

I, therefore, do not have any means to vectorize this file directly in one piece on my computer. Memory usage goes up steadily through v.build.

I'll try r.to.vect.tiled.

I'll also try to run v.build through valgrind to see if anything shows up.

comment:17 in reply to:  15 Changed 3 years ago by mmetz

Replying to mlennert:

Replying to mmetz:

Replying to mlennert:

Without (i.e. grass70) my machine became unusable for a while and then the module execution stopped after a bit more than 5 minutes because of too little memory. With the fix (grass73) it went all the way to 100% of writing areas (17 minutes on my machine), but then I got an error "Category index is not up to date".

The -v and -b flags are mutually exclusive. The -v flag is disabled for certain input/flag combinations, but this test was missing. Added in trunk r68720.

Ah, ok. Why is topology necessary for category values ? Is it because areas are only defined via topological connection between centroids and boundaries ?

The error is that the "Category index is not up to date". This category index is built together with vector topology and allows fast searching for feature categories. I have changed r.to.vect in trunk r68740 such that this category index is no longer required in order to populate the attribute table.

This is a pity for us: if you want to use, for example, i.segment results both as raster and vector, you need to keep the link between the two via the segment ids. As attribute table handling really slows things down, -vt is the preferred way to create a vector map with polygons that have the same category as the segment ids in the raster maps...

This is now working. If you only want areas to have the same category like the input regions / clusters, you do not need an attribute table and can use r.to.vect -vt.

However, my colleague's problem was not in the vectorization stage, but in the topology building stage. I'm currently testing with -vt and VECTOR_LOW_MEM=1. It's been running for about 1 h. I'll see tomorrow what the result is...

The topology building stage might need a lot of memory. If a lot of memory is still in use because of a memory leak in the extraction stage, the topology building stage is more likely to fail because of an out-of-memory error. There is no guarantee that topology building will succeed, but the chances for success are higher now that the (hopefully last) memory leak in the extraction stage has been fixed. If topology building still fails with an out-of-memory error, try again with the environment variable GRASS_VECTOR_LOWMEM set.

I did use GRASS_VECTOR_LOWMEM=1, but the process was stopped. No explicit memory error, but AFAIK such stopping of a process happens in case of memory errors:

Enregistrement des primitives ...
63282475 primitives registered
397657832 vertices registered
Construction des surfaces ...
Processus arrêté

real    106m12.306s
user    28m2.664s
sys     68m42.260s

That could also happen if there is "no space left on device" (disk full error).

comment:18 in reply to:  16 ; Changed 3 years ago by mmetz

Replying to mlennert:

Replying to mlennert:

I did use GRASS_VECTOR_LOWMEM=1, but the process was stopped. No explicit memory error, but AFAIK such stopping of a process happens in case of memory errors:

Enregistrement des primitives ...
63282475 primitives registered
397657832 vertices registered
Construction des surfaces ...
Processus arrêté

real    106m12.306s
user    28m2.664s
sys     68m42.260s

Next try will be to use v.build on the vector created by r.to.vect -b...

> export GRASS_VECTOR_LOWMEM=1
> time v.build seg_full
Construction de la topologie pour la carte vectorielle
<seg_full@rtovecttest>...
Enregistrement des primitives ...
63282475 primitives registered
397657832 vertices registered
Construction des surfaces ...
Processus arrêté

real    84m57.309s
user    19m8.872s
sys     54m37.868s

Strange. Topology building should be much slower with GRASS_VECTOR_LOWMEM, therefore I would expect v.build with GRASS_VECTOR_LOWMEM to fail later, not earlier. Can you provide data and commands for testing?

:-(

I, therefore, do not have any means to vectorize this file directly in one piece on my computer. Memory usage goes up steadily through v.build.

I'll try r.to.vect.tiled.

This should also fail if you want r.to.vect.tiled to patch the tiles together.

comment:19 in reply to:  18 ; Changed 3 years ago by mlennert

Replying to mmetz:

Replying to mlennert:

> export GRASS_VECTOR_LOWMEM=1
> time v.build seg_full
Construction de la topologie pour la carte vectorielle
<seg_full@rtovecttest>...
Enregistrement des primitives ...
63282475 primitives registered
397657832 vertices registered
Construction des surfaces ...
Processus arrêté

real    84m57.309s
user    19m8.872s
sys     54m37.868s

Strange. Topology building should be much slower with GRASS_VECTOR_LOWMEM, therefore I would expect v.build with GRASS_VECTOR_LOWMEM to fail later, not earlier. Can you provide data and commands for testing?

Data and info provided via private mail.

Previous posters have also reported that setting GRASS_VECTOR_LOWMEM didn't make much of a difference, so maybe the problem is there ?

comment:20 in reply to:  19 ; Changed 3 years ago by mmetz

Replying to mlennert:

Data and info provided via private mail.

I have received the data, thanks!

Previous posters have also reported that setting GRASS_VECTOR_LOWMEM didn't make much of a difference, so maybe the problem is there ?

I don't think so. Without GRASS_VECTOR_LOWMEM, v.build needs about 26 GB of RAM. With GRASS_VECTOR_LOWMEM, v.build needs about 11 GB of RAM but is about 4.5 times slower, that is expected because some temporary data are kept on disk and not in memory. File sizes are 6.4 GB for coor (the actual vector features), 3.1 GB for topo (main vector topology), 11 GB for sidx (spatial index) and 922 MB for cidx (category index).

comment:21 in reply to:  20 ; Changed 3 years ago by mlennert

Replying to mmetz:

Replying to mlennert:

Data and info provided via private mail.

I have received the data, thanks!

Previous posters have also reported that setting GRASS_VECTOR_LOWMEM didn't make much of a difference, so maybe the problem is there ?

I don't think so. Without GRASS_VECTOR_LOWMEM, v.build needs about 26 GB of RAM. With GRASS_VECTOR_LOWMEM, v.build needs about 11 GB of RAM but is about 4.5 times slower, that is expected because some temporary data are kept on disk and not in memory.

Why does it still need 11GB of RAM when working with disk-based data ? Is this really unavoidable ?

GRASS has always had a tradition of allowing to work on limited resources. Even though RAM is cheap, I would still consider > 11GB as rather high-end in many offices. I wouldn't mind if it makes the command much slower, but it would be nice if it could get the job done with less RAM usage...

But I have nothing to propose, so I'll leave it at that.

comment:22 in reply to:  21 Changed 3 years ago by mmetz

Replying to mlennert:

Replying to mmetz:

Replying to mlennert:

Data and info provided via private mail.

I have received the data, thanks!

Previous posters have also reported that setting GRASS_VECTOR_LOWMEM didn't make much of a difference, so maybe the problem is there ?

I don't think so. Without GRASS_VECTOR_LOWMEM, v.build needs about 26 GB of RAM. With GRASS_VECTOR_LOWMEM, v.build needs about 11 GB of RAM but is about 4.5 times slower, that is expected because some temporary data are kept on disk and not in memory.

Why does it still need 11GB of RAM when working with disk-based data ? Is this really unavoidable ?

Technically, this is not unavoidable. Main topology and the category index could also be managed on disk, but this is not easy to implement. You could ask for enhancements for GRASS 8.

GRASS has always had a tradition of allowing to work on limited resources.

With the exception of vector topology which requires a lot of resources. RAM consumption for vector topology has been substantially reduced in GRASS 7 compared to GRASS 6, independent of the GRASS_VECTOR_LOWMEM switch.

Even though RAM is cheap, I would still consider > 11GB as rather high-end in many offices. I wouldn't mind if it makes the command much slower, but it would be nice if it could get the job done with less RAM usage...

Something for GRASS 8+. Also, a vector with >17 million areas is rather unusual. For example, this vector can not be exported as a shapefile because "The Shapefile format explicitly uses 32bit offsets" and I estimate the *.shp file to be >10GB.

comment:23 Changed 3 years ago by mmetz

Resolution: fixed
Status: newclosed

In 68786:

r.to.vect: sync to trunk, fix #2045

comment:24 Changed 3 years ago by mmetz

In 68787:

r.to.vect: sync to trunk, fix #2045

comment:25 in reply to:  20 ; Changed 3 years ago by mlennert

Replying to mmetz:

Replying to mlennert:

Data and info provided via private mail.

I have received the data, thanks!

Previous posters have also reported that setting GRASS_VECTOR_LOWMEM didn't make much of a difference, so maybe the problem is there ?

I don't think so. Without GRASS_VECTOR_LOWMEM, v.build needs about 26 GB of RAM. With GRASS_VECTOR_LOWMEM, v.build needs about 11 GB of RAM

A question that came up in a discussion with colleagues: could this memory need be determined before building topology and possibly check for enough available memory / warn the user ?

comment:26 in reply to:  25 Changed 3 years ago by mmetz

Replying to mlennert:

Replying to mmetz:

Replying to mlennert:

Data and info provided via private mail.

I have received the data, thanks!

Previous posters have also reported that setting GRASS_VECTOR_LOWMEM didn't make much of a difference, so maybe the problem is there ?

I don't think so. Without GRASS_VECTOR_LOWMEM, v.build needs about 26 GB of RAM. With GRASS_VECTOR_LOWMEM, v.build needs about 11 GB of RAM

A question that came up in a discussion with colleagues: could this memory need be determined before building topology and possibly check for enough available memory / warn the user ?

Unfortunately not because the number of features is unknown when topology building starts. What could be done is counting primitives without building topology, then estimate memory requirements just for these primitives. The number of areas can not be estimated.

Note: See TracTickets for help on using tickets.