Opened 13 years ago

Closed 13 years ago

#1005 closed defect (fixed)

[raster] Problem with ST_SummaryStats(text, text) results

Reported by: pracine Owned by: dustymugs
Priority: medium Milestone: PostGIS 2.0.0
Component: raster Version: master
Keywords: Cc:

Description

I can observe some problems with the numbers returned by the coverage variant of ST_SummaryStats

1) The sum is the sum of the last tile only. Not of the whole coverage.

2) For a raster coverage of two rasters created like this:

--First test raster
CREATE OR REPLACE FUNCTION ST_TestRaster1() 
    RETURNS raster AS 
    $$
    DECLARE
        newrast raster := ST_AddBand(ST_MakeEmptyRaster(2, 2, 0, 0, 1, 1, 0, 0, -1), '32BUI'::text, 0, -1);
    BEGIN
        newrast := ST_SetValue(newrast, 1, 1, 1);
        newrast := ST_SetValue(newrast, 1, 2, 3);
        newrast := ST_SetValue(newrast, 2, 1, 3);
        newrast := ST_SetValue(newrast, 2, 2, 1);
        RETURN newrast;
    END;
    $$
    LANGUAGE 'plpgsql';

--Second test raster
CREATE OR REPLACE FUNCTION ST_TestRaster2() 
    RETURNS raster AS 
    $$
    DECLARE
        newrast raster := ST_AddBand(ST_MakeEmptyRaster(2, 2, 2, 2, 1, 1, 0, 0, -1), '32BUI'::text, 0, -1);
    BEGIN
        newrast := ST_SetValue(newrast, 1, 1, 1);
        newrast := ST_SetValue(newrast, 1, 2, 2);
        newrast := ST_SetValue(newrast, 2, 1, 3);
        newrast := ST_SetValue(newrast, 2, 2, 4);
        RETURN newrast;
    END;
    $$
    LANGUAGE 'plpgsql';

--Create table
DROP TABLE sumstattest;
CREATE TABLE sumstattest AS
 SELECT ST_TestRaster2() rast
 UNION ALL
 SELECT ST_TestRaster1() rast;

SELECT * FROM st_summarystats('sumstattest', 'rast');

should return 1.089724736 not 1.06066017770915.

Result verified with the stdev.p function in Excel.

Change History (27)

comment:1 by pracine, 13 years ago

Result from:

SELECT (ss).* 
FROM (SELECT st_summarystats(rast) ss
      FROM sumstattest
      ) foo;

are correct.

comment:2 by dustymugs, 13 years ago

Thanks for catching the error with the sum as I forgot to collect the sum for coverages. My next revision will fix that.

As for the standard deviation of a coverage, I do confirm the issue. To do this correctly will require going over the coverage table twice, once for the coverage's mean and the second for the mean's standard deviation. So, do we want to be accurate at the expense of speed or be close at the expense of accuracy? Maybe we need to add an option flagging whether or not to get the accurate standard deviation for a coverage?

comment:3 by dustymugs, 13 years ago

Fixed the ST_SummaryStats for coverage tables bug regarding the value for sum in r7340.

in reply to:  2 comment:4 by pracine, 13 years ago

Replying to dustymugs:

As for the standard deviation of a coverage, I do confirm the issue. To do this correctly will require going over the coverage table twice, once for the coverage's mean and the second for the mean's standard deviation. So, do we want to be accurate at the expense of speed or be close at the expense of accuracy? Maybe we need to add an option flagging whether or not to get the accurate standard deviation for a coverage?

1) We must be clear about that in the doc and we should provide (a link to?) the method used to get the approximative value

2) I agree with the option. Default should be the approximate value.

I thought the method you found (one pass stddev) provided accurate result. I thought we added the sum to be able to get an accurate stddev. How does it work for one tile? Is the stddev accurate for one tile?

comment:5 by dustymugs, 13 years ago

The one pass standard deviation is accurate for any single tile. I'm going to do some refactoring today to see if I can adapt the one pass standard deviation to process a coverage table though. If I can get it working, we won't need flagging or anything of the sort.

The sum was added to get the true mean instead of a weighted mean.

comment:6 by dustymugs, 13 years ago

Owner: changed from pracine to dustymugs
Status: newassigned

comment:7 by pracine, 13 years ago

Interesting that:

SELECT (ss).* 
FROM (SELECT st_summarystats(rast) ss
      FROM srtm_22_03_tiled_100x100
      ) foo

takes 32 seconds on my machine and:

SELECT * FROM st_summarystats('srtm_22_03_tiled_100x100', 'rast');

takes only 6 seconds. How is this possible? We are counting the same number of pixels here.

comment:8 by robe, 13 years ago

Pierre,

Are you running on 8.4. I think things might have changed in 9.0. check out my article - at the end see my compare between 8.4/9.0 — but I never got a chance to further test to confirm.

http://www.postgresonline.com/journal/archives/201-returns-table.html

comment:9 by robe, 13 years ago

I suspect in your first one, the planner has decided not to materialize the subselect.

Interesting to see what would happen if you did something like:

SELECT (ss).* 
FROM (SELECT st_summarystats(rast) ss
      FROM srtm_22_03_tiled_100x100
      OFFSET 0) foo

comment:10 by pracine, 13 years ago

6 seconds…

in reply to:  8 comment:11 by pracine, 13 years ago

Replying to robe:

Pierre,

Are you running on 8.4. I think things might have changed in 9.0. check out my article - at the end see my compare between 8.4/9.0 — but I never got a chance to further test to confirm.

http://www.postgresonline.com/journal/archives/201-returns-table.html

Ok,

SELECT (ST_SummaryStats(rast)).* FROM srtm_22_03_tiled_100x100

still takes 32 seconds and strangely:

SELECT (st_summarystats('srtm_22_03_tiled_100x100', 'rast')).*

now takes also 32 seconds. So the fastest/shorter query is still

SELECT * FROM st_summarystats('srtm_22_03_tiled_100x100', 'rast');

That's on my old 8.4

comment:12 by robe, 13 years ago

how big is your dataset. of course as you mentioned before its calling the function for each each column in the bad case.

comment:13 by pracine, 13 years ago

It's one 6000x6000 SRTM elevation file tiled into 100x100 tiles (3600 tiles). It's my favorite test dataset.

comment:14 by dustymugs, 13 years ago

Fixed the standard deviation component of ST_SummaryStats for coverage tables bug in r7342.

comment:15 by pracine, 13 years ago

So we have an accurate coverage stddev?

comment:16 by dustymugs, 13 years ago

Yes. It uses the same one-pass standard deviation calculation method that is used for each individual tile. Your example above comes to the correct value.

comment:17 by robe, 13 years ago

Pierre,

Okay I'm seeing similar behavior on PostgreSQL 9.1 for ST_SummaryStats and ST_Histogram. So I think it has to do with if you materialize vs. not.

So here is a single tile raster I have with 1,517,411 pixels.

-- This takes 1,147ms --
SELECT (f).* 
FROM (SELECT (ST_SummaryStats(rast,1)) As f
from philly_map) As foo;

—but this takes 214 ms

SELECT (f).* 
FROM (SELECT (ST_SummaryStats(rast,1)) As f
from philly_map OFFSET 0) As foo;

NOTE in the second — I used the OFFSET 0 hack to force materialization of the subselect as described here — http://www.postgresonline.com/journal/archives/113-How-to-force-PostgreSQL-to-use-a-pre-calculated-value.html

— The histogram case is tempermental if it materializes — or not since it is dependent on number of rows

-- Takes 294 ms returns 22 rows (with the offset takes 306ms -- so about the same
SELECT (f).*
FROM (SELECT ST_Histogram(rast,1) As f
from philly_map) As foo;

So I guess short answer — if we are going to be doing wrapper stuff, we better put in an OFFSET 0 in there just to be safe. Haven't tested on 8.4, but as I recall way back, its the same story.

comment:18 by pracine, 13 years ago

So I understand that Bborie should try inserting OFFSET 0 in the wrappers before getting rid of them?

comment:19 by dustymugs, 13 years ago

I'd rather get rid of the ST_SummaryStats wrappers:

ST_Count ST_Sum ST_Mean ST_Stddev ST_MinMax

I don't think adding OFFSET 0 to any of the functions will make any difference as the OFFSET 0 applies to the query calling ST_SummaryStats rather than within ST_SummaryStats.

comment:20 by robe, 13 years ago

I kind of like ST_Count since its a basic measure of how big your tile is. The others I think if people get to that level of granularity they should just run summary stats.

So I vote to keep ST_Count but get rid of the other wrappers. Besides ST_Count has a short-circuit.

comment:21 by dustymugs, 13 years ago

Sounds good to me. Pierre, thoughts? If you say yes, I'll eliminate the following functions and keep ST_Count:

ST_Sum ST_Mean ST_Stddev ST_MinMax

comment:22 by pracine, 13 years ago

I would remove ST_Count as well. In the case mentioned by robe, it is very easy to do ST_Height() * ST_Width() instead. The only remaining interest for ST_Count() is that is tells you of many pixel have a true value and in this case we are already at the lower level of granularity.

comment:23 by robe, 13 years ago

I disagree — I find myself using count a lot (more so than summarystats) and hate to part with it. I can see myself typing ST_Height()*ST_Width() a lot which is too verbose and if I have tiled rasters even more annoying as I'd have to do sum(st_height*st_width) etc or resort to ST_SummaryStats. Most of the time I just care about the count of nodata vs. with nodata. The fact it returns a simple number is sweeter than ST_SummaryStats and easier to register in the mind.

comment:24 by pracine, 13 years ago

Then let there be ST_Count()!

comment:25 by dustymugs, 13 years ago

I've removed the following wrappers in r7375.

ST_Sum
ST_Mean
ST_Stddev
ST_MinMax

comment:26 by dustymugs, 13 years ago

Regina and/or Pierre, if you both feel that this ticket has been resolved, please close this ticket. Thanks!

comment:27 by pracine, 13 years ago

Resolution: fixed
Status: assignedclosed
Note: See TracTickets for help on using tickets.