Opened 7 years ago

Closed 7 years ago

#3198 closed defect (fixed)

r.stats.quantile: hardcoded max number of categries in base map

Reported by: mlennert Owned by: grass-dev@…
Priority: normal Milestone: 7.2.1
Component: Raster Version: unspecified
Keywords: r.stats.quantile MAX_CATS Cc:
CPU: Unspecified Platform: Unspecified

Description

r.stats.quantile limits the number of categories the base map can have to 1000 through a MAX_CATS variable.

Is there any specific reason for this ? I would like to use r.stats.quantile in i.segment.stats to calculate percentiles per segment, but number of segments can be much higher than 1000.

Classifying this as a "bug" for now...

Change History (4)

in reply to:  description ; comment:1 by glynn, 7 years ago

Replying to mlennert:

Is there any specific reason for this ? I would like to use r.stats.quantile in i.segment.stats to calculate percentiles per segment, but number of segments can be much higher than 1000.

The limit was added so that if someone tries to use a base map with a million categories, it just fails quickly, rather than attempting something which will either exhaust memory or take days to run.

For each category in the base map, it allocates a basecat structure, each of which references several dynamically-allocated arrays. The .slots and .slot_bins arrays are sized based upon the bins= option, the .values array is sized to hold all of the values falling into any bin containing to a quantile, the .quants and .bins arrays according to the number of quantiles.

As well as the memory consumption, almost all processing is per-category.

Having said that, more categories will tend to result in less data per category. However, there are some non-trivial per-category overheads. On the other hand, sorting the bins containing quantiles should be faster overall with more bins but proportionally less data in each bin.

There's no fundamental reason why the limit can't be raised; or even abolished, if you don't mind an unsuitable choice of base map resulting in "unable to allocate" errors, or just taking forever. Consider putting a limit on num_cats*num_slots; a map with many categories should presumably require fewer bins (assuming that the data isn't concentrated into a handful of categories).

in reply to:  1 ; comment:2 by mlennert, 7 years ago

Replying to glynn:

Replying to mlennert:

Is there any specific reason for this ? I would like to use r.stats.quantile in i.segment.stats to calculate percentiles per segment, but number of segments can be much higher than 1000.

The limit was added so that if someone tries to use a base map with a million categories, it just fails quickly, rather than attempting something which will either exhaust memory or take days to run.

For each category in the base map, it allocates a basecat structure, each of which references several dynamically-allocated arrays. The .slots and .slot_bins arrays are sized based upon the bins= option, the .values array is sized to hold all of the values falling into any bin containing to a quantile, the .quants and .bins arrays according to the number of quantiles.

As well as the memory consumption, almost all processing is per-category.

Having said that, more categories will tend to result in less data per category. However, there are some non-trivial per-category overheads. On the other hand, sorting the bins containing quantiles should be faster overall with more bins but proportionally less data in each bin.

There's no fundamental reason why the limit can't be raised; or even abolished, if you don't mind an unsuitable choice of base map resulting in "unable to allocate" errors, or just taking forever.

A warning was maintained. At least the user is made aware and can stop the module.

Consider putting a limit on num_cats*num_slots; a map with many categories should presumably require fewer bins (assuming that the data isn't concentrated into a handful of categories).

In r69776 MarkusM introduce dynamic bins, although I don't really understand what this means ;-).

More generally: the man page of r.stats.quantile does lack a bit of info about its parameters, notably the 'bin' parameter. A short paragraph explaining how the module works would be useful.

in reply to:  2 comment:3 by mmetz, 7 years ago

Replying to mlennert:

Replying to glynn:

[...]

There's no fundamental reason why the limit can't be raised; or even abolished, if you don't mind an unsuitable choice of base map resulting in "unable to allocate" errors, or just taking forever.

A warning was maintained. At least the user is made aware and can stop the module.

FWIW, I tested with more than a million categories in the base map and the module finished within 19 seconds (on an old laptop).

Consider putting a limit on num_cats*num_slots; a map with many categories should presumably require fewer bins (assuming that the data isn't concentrated into a handful of categories).

In r69776 MarkusM introduce dynamic bins, although I don't really understand what this means ;-).

For example, if there are only 10 cells for a given basemap category, it does not make sense to allocate 1000 bins for that category, instead a single bin is sufficient. With many basemap categories and only few values for each category, memory consumption can be reduced by 90% down to 10% of the previous version of r.stats.quantile. Still, with many basemap categories and many cells per category, the module will be slow and will need a lot of memory.

comment:4 by mlennert, 7 years ago

Resolution: fixed
Status: newclosed

Closing this as the original issue seems to be fixed. As the solution is not a simple bug fix, it should probably not go into 7.2.

Note: See TracTickets for help on using tickets.