Opened 9 years ago

Closed 9 years ago

#985 closed task (fixed)

[raster] ST_Count

Reported by: Bborie Park Owned by: Bborie Park
Priority: medium Milestone: PostGIS 2.0.0
Component: raster Version: master
Keywords: history Cc:

Description

This function calls ST_SummaryStats and only returns the count from that function.

  1. ST_Count(rast raster, nband int, hasnodata boolean) -> integer

returns the count as an integer

nband: index of band

hasnodata: if FALSE, any pixel who's value is nodata is ignored

ST_Count(rast, 1, FALSE)
  1. ST_Count(rast raster, nband int) -> integer

assumes hasnodata = TRUE

ST_Count(rast, 2)
  1. ST_Count(rast raster, hasnodata boolean) -> integer

assumes nband = 1

ST_Count(rast, TRUE)
  1. ST_Count(rast raster) -> integer

assumes nband = 1 and hasnodata = TRUE

ST_Count(rast)

The set of ST_ApproxCount functions are:

  1. ST_ApproxCount(rast raster, nband int, hasnodata boolean, sample_percent double precision) -> integer

sample_percent: a value between 0 and 1 indicating the percentage of the raster band's pixels to consider

ST_ApproxCount(rast, 3, FALSE, 0.1)

ST_ApproxCount(rast, 1, TRUE, 0.5)
  1. ST_ApproxCount(rast raster, nband int, sample_percent double precision) -> integer

assumes that nband = 1

ST_ApproxCount(rast, 2 0.01)

ST_ApproxCount(rast, 4, 0.025)
  1. ST_ApproxCount(rast raster, hasnodata boolean, sample_percent double precision) -> integer

assumes that nband = 1

ST_ApproxCount(rast, FALSE, 0.01)

ST_ApproxCount(rast, TRUE, 0.025)
  1. ST_ApproxCount(rast raster, sample_percent double precision) -> integer

assumes that nband = 1 and hasnodata = TRUE

ST_ApproxCount(rast, 0.25)
  1. ST_ApproxCount(rast raster) -> integer

assumes that nband = 1, hasnodata = TRUE and sample_percent = 0.1

ST_ApproxCount(rast)

The following functions are provided for coverage tables.

  1. ST_Count(rastertable text, rastercolumn text, nband int, hasnodata boolean) -> integer

rastertable: name of table with raster column

rastercolumn: name of column of data type raster

ST_Count('tmax_2010', 'rast', 1, FALSE)

ST_Count('precip_2011', 'rast', 1, TRUE)
  1. ST_Count(rastertable text, rastercolumn text, nband int) -> integer

hasnodata = TRUE

ST_Count('tmax_2010', 'rast', 1)
  1. ST_Count(rastertable text, rastercolumn text, hasnodata boolean) -> integer

nband = 1

ST_Count('precip_2011', 'rast', TRUE)
  1. ST_Count(rastertable text, rastercolumn text) -> integer

nband = 1 and hasnodata = TRUE

ST_Count('tmin_2009', 'rast')

Variations for ST_ApproxCount are:

  1. ST_ApproxCount(rastertable text, rastercolumn text, nband int, hasnodata boolean, sample_percent double precision) -> integer
ST_ApproxCount('tmax_2010', 'rast', 1, FALSE, 0.5)

ST_ApproxCount('precip_2011', 'rast', 1, TRUE, 0.2)
  1. ST_ApproxCount(rastertable text, rastercolumn text, nband int, sample_percent double precision) -> integer

hasnodata = TRUE

ST_ApproxCount('tmax_2010', 'rast', 1, 0.5)

ST_ApproxCount('precip_2011', 'rast', 1, 0.2)
  1. ST_ApproxCount(rastertable text, rastercolumn text, hasnodata boolean, sample_percent double precision) -> integer

nband = 1

ST_ApproxCount('tmax_2010', 'rast', FALSE, 0.5)

ST_ApproxCount('precip_2011', 'rast', TRUE, 0.2)
  1. ST_ApproxCount(rastertable text, rastercolumn text, sample_percent double precision) -> integer

nband = 1 and hasnodata = TRUE

ST_ApproxCount('tmax_2010', 'rast', 0.5)

ST_ApproxCount('precip_2011', 'rast', 0.2)
  1. ST_ApproxCount(rastertable text, rastercolumn text) -> integer

nband = 1, hasnodata = TRUE and sample_percent = 0.1

ST_ApproxCount('tmax_2010', 'rast')

ST_ApproxCount('precip_2011', 'rast')

Change History (29)

comment:1 Changed 9 years ago by Bborie Park

Owner: changed from pracine to Bborie Park
Status: newassigned

Added in r7246.

comment:2 Changed 9 years ago by Bborie Park

Resolution: fixed
Status: assignedclosed

comment:3 Changed 9 years ago by robe

Keywords: history added
Milestone: PostGIS Raster FuturePostGIS 2.0.0

Guys isn't the semantics of this function inconsistent with semantics of ST_Value.

Shouldn't hasnodata = true mean you want to also count pixels that have no data?

comment:4 Changed 9 years ago by robe

Resolution: fixed
Status: closedreopened

I was also thinking this terminology is mighty confusing. I cna't tell if I'm coming or going. Can't we just call the second argument include_nodata (if its true then nodata values get included and if false then nodata values do not get included)

comment:5 Changed 9 years ago by Bborie Park

Actually, the information in this ticket's description for hasnodata is incorrect. The correct description for hasnodata should be "If false, pixels with nodata values are included in the count". Pierre explained the meaning of hasnodata to me in an email...

"The doc is correct and consequent with ST_Intersects. If "hasnodata" is set to TRUE it means we consider the raster HAS nodata and we take them into account. Maybe here is the source of your confusion: Taking nodata value into account means IGNORING them (that's really confusing) since they are no data. Not taking them into account means "We do like if there was NO nodata values" and hence we do not take them into account in the computations.

There is a difference between a) "Take the nodata value set into account" and b) "Take the pixel nodata values into account.

When a) is TRUE b) is FALSE...

We are generally speaking about a) in our discussions and this is the meaning of the hasnodata parameter.

Confusing enough?

If you think of a better way to handle this, let us know."

I do like "include_nodata" as the argument name is easy to understand in the context of the function.

comment:6 Changed 9 years ago by Bborie Park

Sigh. It looked like the description of "hasnodata" is wrong in various spots in the working specs. I've fixed that too. Sorry for the stale and misleading information.

comment:7 in reply to:  5 Changed 9 years ago by pracine

I do like "include_nodata" as the argument name is easy to understand in the context of the function.

I would like "include_nodata" too if it would also mean something in the context of other functions like ST_Intersects(). But no...

To make it brief: "hasdodata = TRUE means the raster HAS nodata values, so the function MUST DEAL with those nodata values in the proper way (ignore them). If hasdodata is FALSE, is should consider nodata values as any normal values."

Up to now this is still the most simple way i have found to specify this which work everywhere.

comment:8 Changed 9 years ago by robe

Yah I prefer include_nodata. I think the hasnodata is too confusing and used in one too many ways.

As I recall I think there was a bit flag that would tell you if a raster had nodata so that was hasnodata (then we changed this to just check if bandnodata is null) and if you knew that you would know you could use the envelope for intersect checks instead of the convex hull and then we started to use it for more than a fast check too which is when it became really confusing. So by saying hasnodata you were overriding the behavior of what the default raster told you it had and if you didn't pass it in, it meant read it from the meta data (which ends up being the same thing if you think about it)

So the default meant read the hasnodata property from the raster meta data and setting it to anything meant override whatever it said to (hasnodata = true and hasnodata = false). Anyrate after Pierre's explanations I still don't get it and its so confusing to be worse than useless.

I say we go with include_nodata -- if you set to true then it doesn't matter if the raster has a nodata value or does not have a nodata value cause you'll be counting all or excluding it if it does.

For ST_ValueCount -- I would simply get rid of that flag entirely unless it results in speed improvement. If I'm going to pass a set of pixel values, does it really matter the hasnodata. Am I really saving anything? If one of the pixels I pass in the array happens to be the nodata value, then of course I want to count it and if it isn't, then I wouldn't be counting it anyway.

comment:9 Changed 9 years ago by Bborie Park

I think the ideal solution (at least for all the functions I've written) is to eliminate the "hasnodata" function argument. If a band is flagged as having nodata values and the nodata value is set, just the function run with it. This assumes that the creator/maintainer of the raster knows what they're doing so the function goes with what it is given.

There are no speed improvements for the "hasnodata" flag, just a filtering one.

comment:10 Changed 9 years ago by pracine

I would tend to agree. If there are some nodata values and I want to count them I can easily do:

SELECT ST_ValueCount(ST_SetBandNodataValue(rast, NULL))

Otherwise:

SELECT ST_ValueCount(rast, ...)

Would not count them. But what if I include the nodata value in the list of value to count?

comment:11 Changed 9 years ago by Bborie Park

If the nodata value was in the list of values to count, I would think it wouldn't get counted (or more specifically, returns zero count). Your example of unsetting the nodata flag before passing into ST_ValueCount would be the correct way to do it and would not affect performance.

comment:12 Changed 9 years ago by Bborie Park

I'm easy so I'll do whatever folks agree on or Pierre dictates.

comment:13 Changed 9 years ago by pracine

So this should apply to: st_summarystats, st_approxsummarystats, st_count, st_approxcount, st_sum, st_approxsum, st_mean, st_approxmean, st_stddev, st_approxstddev, st_minmax (still exist??), st_approxminmax, _st_histogram, st_histogram, st_approxhistogram, _st_quantile, st_quantile, st_approxquantile, _st_valuecount and st_valuecount. Right?

and why not: st_value ?

One question:

Is

SELECT ST_ValueCount(ST_SetBandNodataValue(rast, NULL))

as fast as

SELECT ST_ValueCount(rast, FALSE, ...)

?

comment:14 Changed 9 years ago by Bborie Park

Yes, all the functions you mentioned could be changed.

case 1) SELECT ST_ValueCount(ST_SetBandNodataValue(rast, NULL))

The above is NOT as fast as

case 2) SELECT ST_ValueCount(rast, FALSE, ...)

This isn't as significant for small rasters (100 x 100) versus large rasters (10000 x 10000) due to the time required for:

  1. For ST_SetBandNodataValue, deserialize raster
  1. set hasnodata flag
  1. serialize raster
  1. For ST_ValueCount, deserialize raster
  1. process pixels
  1. return counts

For case 2, on steps 4, 5 and 6 would occur vs steps 1 through 6 for the two step. So, maybe we should keep the "hasnodata" and just give it a better name?

comment:15 Changed 9 years ago by pracine

I think we should keep it then...

I think this name should be consistent and involve the same behavior for all functions (ST_Value and ST_Intersects included).

What about "disablebandnodata"? Which means the opposite of hasnodata... and should be FALSE by default.

I think Regina should come with a consistent name that we can apply everywhere. I was happy with hasnodata. Snif ;-( I already miss it...

comment:16 Changed 9 years ago by Bborie Park

I've renamed the function parameter "hasnodata" to "exclude_nodata_value" in r7294 for ST_SummaryStats, ST_Count, ST_Sum, ST_Mean, ST_StdDev, ST_MinMax, ST_Quantile, ST_Histogram, ST_ValueCount and ST_ValuePercent.

comment:17 Changed 9 years ago by pracine

Could not ST_Count be optimized to "ST_Height() * ST_Width()" when ST_BandNodataValue() = NULL in cases when we do not care about the other stats generated by st_summarystats()?

A side question I have been wondering since a while: Right now, does:

  SELECT ST_Sum(rast), ST_Mean(rast) FROM mytable

takes twice the time of (?):

  SELECT (ss).sum, (ss).mean FROM (SELECT ST_SummaryStat(rast) ss FROM mytable) foo

Another one: Do we still want to conserve ST_MinMax instead of ST_Min and ST_Max?

comment:18 Changed 9 years ago by Bborie Park

Yes, ST_Count can be optimized for the situation where ST_BandNodataValue() = NULL.

  SELECT ST_Sum(rast), ST_Mean(rast) FROM mytable

In your example above, that would take twice the time as each function accesses the raster independently. This could be resolved if we cached the raster's stats within the raster.

The faster version of your example would be

SELECT sum, mean FROM ST_SummaryStats('mytable', 'rast')

Or some variation there-in.

Maybe Regina could chime in about keeping ST_MinMax versus breaking it apart to ST_Min and ST_Max. Personally, I'm indifferent except for the growing set of functions.

comment:19 Changed 9 years ago by pracine

You won't like that and it's a late comment but should we not NOT provide those ST_SummaryStat(rast) wrappers since they will always end-up being slower than the above constructs?

comment:20 Changed 9 years ago by Bborie Park

I don't know if it is possible to provide some sort of plpgsql wrapper that would do

SELECT ST_Sum(rast), ST_Mean(rast) FROM mytable}}}

as one ST_SummaryStats(rast) under the hood. I suppose there might be a way to have the backend cache the stats for only the duration of the transaction but this may get ugly when dealing with a large coverage table as each tile's stats would be cached.

comment:21 Changed 9 years ago by robe

Sorry missed this one. Yes I would remove those wrappers. That's why I've been slow commenting them. I think having too many functions makes it difficult to find the important ones. These are as far as I am concerned redundant.

comment:22 Changed 9 years ago by pracine

Really sorry about that Bborie... I should have seen this way before.

comment:23 Changed 9 years ago by Bborie Park

So, are you two proposing the elimination of the following wrapper functions:

ST_Count ST_Sum ST_Mean ST_Stddev ST_MinMax

and the Approx versions of the above? I'm just confirming as my vote is "yes" to eliminating them! Granted, this would eliminate the shortcut that Pierre mentioned earlier this week and I implemented for ST_Count.

On a side note, I really don't like the fact the inefficiency of getting a raster's width and height using ST_Width and ST_Height. As I was writing the ST_Histogram function for coverage tables, it gets really inefficient calling both functions for the dimensions instead of calling a unified ST_Dimensions function.

comment:24 Changed 9 years ago by pracine

I see two things in that:

1) We should not have to deserialize the whole raster to get the raster georeferencing properties. We should have a rt_raster_deserialize_header function. That would certainly make all all the getter faster. It is possible to deserialize only part of the raster? It seems to me that just copying rt_raster_deserialize until the first memcpy would make a working rt_raster_deserialize_header function.

2) We can not push the preceding logic too far and remove all the metadata getters. On the other side we could have a more efficient C implementation of ST_Metadata and ST_BandMetadata.

Would that fill your needs?

comment:25 in reply to:  24 Changed 9 years ago by Bborie Park

Replying to pracine:

1) We should not have to deserialize the whole raster to get the raster georeferencing properties. We should have a rt_raster_deserialize_header function. That would certainly make all all the getter faster. It is possible to deserialize only part of the raster? It seems to me that just copying rt_raster_deserialize until the first memcpy would make a working rt_raster_deserialize_header function.

We could just extend rt_raster_deserialize by adding a parameter flag "header_only" to shortcut the function. That would reduce the potential performance hit. If Jorge is reading this, I'd like him to chime in before I make changes to that function and any calls to that function.

2) We can not push the preceding logic too far and remove all the metadata getters. On the other side we could have a more efficient C implementation of ST_Metadata and ST_BandMetadata.

Would that fill your needs?

Extending rt_raster_deserialize to only deserialize the header should minimize some bottlenecks.

comment:26 Changed 9 years ago by strk

Yes it is possible to only get a slice of the toasted data. grep SLICE postgis/*.c

comment:27 Changed 9 years ago by Bborie Park

Fascinating. I think we'd want to combine PG_DETOAST_DATUM_SLICE as strk suggested with an enhanced rt_raster_deserialize function due to the raster->numBands parameter.

comment:28 Changed 9 years ago by Bborie Park

Raster deserialization has been changed to allow only header information to be extracted in r7376.

comment:29 Changed 9 years ago by Bborie Park

Resolution: fixed
Status: reopenedclosed
Note: See TracTickets for help on using tickets.