Opened 8 years ago

Closed 8 years ago

Last modified 8 years ago

#3184 closed defect (fixed)

v.vect.stats: errors in counts and statistics

Reported by: veroandreo Owned by: grass-dev@…
Priority: normal Milestone: 7.2.0
Component: Vector Version: svn-trunk
Keywords: v.vect.stats Cc:
CPU: x86-64 Platform: Linux

Description

I am working with cases of a disease in Greece. So, I have points (coordinates) and I need to get a count of those cases per municipality and higher administrative units, i.e: NUTS. However, v.vect.stats gives wrong counts. In, my case, I should have 615 cases and if I sum the cases counted by v.vect.stats (using v.db.select mymap column=cases), I get 767. In some polygons, it counts ok, in some it misses points, in others it "invents" extra points.

I guess the whole problem is maybe related to the fact that Greece has so many islands split in a much lower number of categories (i.e.: belonging to a reduced number of administrative units), but I don't know.

Anyway, I attach a vector of polygons and a vector of points as an example. They were created in NC demo location. Just unpack and run:

v.vect.stats -p points=points_test areas=areas_test method=average points_column=value count_column=count stats_column=average

This is the output:

area_cat|count|average
1|3|5.5
2|0|null
3|6|11.3333333333333
3|0|null
6|2|10.5

Printing to stout gives two lines for cat 3, one with the right count, one with 0. However, G7:v.db.select gives a different output:

v.db.select areas_test
cat|count|average
1|3|5.5
2|0|
3|0|
6|2|10.5

If you query the map from the GUI or display the attribute table, polygons with cat 3 have 0 count (see attached screenshot).

Not only the count is wrong, but the average seems also weird. The three points in the polygons with cat 1 have a value of 5, why would average be 5.5?

I'm running GRASS trunk r69699 under Fedora 24.

Attachments (6)

areas_test.pack (3.1 KB ) - added by veroandreo 8 years ago.
vector of areas to perform count
points_test.pack (1.7 KB ) - added by veroandreo 8 years ago.
vector of points to perform counts and statistics
v_vect_stats_errors.png (231.4 KB ) - added by veroandreo 8 years ago.
screenshot showing areas and points
v_vect_stats_unique_cats.diff (1.2 KB ) - added by mlennert 8 years ago.
patch (against trunk) to correctly uniquify list of cats
v_vect_stats_unique_cats.2.diff (1.6 KB ) - added by mlennert 8 years ago.
v_vect_stats_unique_cats_no_half.diff (4.2 KB ) - added by mmetz 8 years ago.
patch against trunk to remove duplicate cats and to eliminate half

Download all attachments as: .zip

Change History (26)

by veroandreo, 8 years ago

Attachment: areas_test.pack added

vector of areas to perform count

by veroandreo, 8 years ago

Attachment: points_test.pack added

vector of points to perform counts and statistics

by veroandreo, 8 years ago

Attachment: v_vect_stats_errors.png added

screenshot showing areas and points

comment:1 by mlennert, 8 years ago

Here's another example with internally generated data:

g.region res=1 n=3 s=0 w=0 e=3 -p
v.mkgrid grid=3,3 map=grid
v.extract grid type=centroid out=gc
v.category grid type=centroid out=grid_no_cats cat=-1 op=del
g.copy vect=grid_no_cats,grid_one_cat
v.edit grid_one_cat type=centroid tool=catadd cat=1 bbox=0,0,3,3

v.vect.stats -p points=gc type=centroid areas=grid points_col=col countcol=count statscol=sum method=sum
Selecting points for each area...
 100%
area_cat|count|sum
1|1|1
2|1|2
3|1|3
4|1|1
5|1|2
6|1|3
7|1|1
8|1|2
9|1|3

v.vect.stats -p points=gc type=centroid areas=grid_one_cat points_col=col countcol=count statscol=sum method=sum
Selecting points for each area...
 100%
area_cat|count|sum
1|0|null
1|0|null
1|9|18
1|0|null
1|0|null

IOW, with all grid cells having the same cat value, results become nonsense except for one case. However, this one case is not the one uploaded to the vector attribute table when one doesn't use the '-p' flag.

comment:2 by veroandreo, 8 years ago

Summary: v.ect.stats: errors in counts and statisticsv.vect.stats: errors in counts and statistics

comment:3 by mlennert, 8 years ago

I think I found the issue.

When running with g.gisenv set=DEBUG=1, I get:

v.vect.stats -p points=gc type=centroid areas=grid_one_cat points_col=col countcol=count statscol=sum method=sum
[...]
D1/1: 9 cats loaded from vector (including duplicates)
D1/1: 5 cats loaded from vector (unique)
[...]

There should be only 1 unique cat.

The attached patch seems to solve this for me, but it needs testing and review.

by mlennert, 8 years ago

patch (against trunk) to correctly uniquify list of cats

comment:4 by veroandreo, 8 years ago

I applied the patch and tested.

  • For my data (615 points, 3600 areas split in 325 categories), I got Segmentation fault.
  • For the data generated with the commands you just posted, the result looks fine when running v.vect.stats with -p flag.
GRASS 7.3.svn (nc_spm_08_grass7):~ > v.vect.stats -p points=gc type=centroid areas=grid_one_cat points_col=col countcol=count statscol=sum method=sum
Selecting points for each area...
 100%
area_cat|count|sum
1|9|18

However, if you remove that flag, to actually update the attribute table (which for this case it is supposed to have only one category), I get:

GRASS 7.3.svn (nc_spm_08_grass7):~ > v.db.select grid_one_cat
cat|row|col|rown|coln|count|sum
1|3|1|C|A|9|18
2|3|2|C|B||
3|3|3|C|C||
4|2|1|B|A||
5|2|2|B|B||
6|2|3|B|C||
7|1|1|A|A||
8|1|2|A|B||
9|1|3|A|C||

Is there something wrong with v.db.select or with v.edit? I followed all your steps. But even if I do v.db.select grid_one_cat before v.vect.stats (and after the v.edit step), I get all the old categories:

GRASS 7.3.svn (nc_spm_08_grass7):~ > v.db.select grid_one_cat
cat|row|col|rown|coln
1|3|1|C|A
2|3|2|C|B
3|3|3|C|C
4|2|1|B|A
5|2|2|B|B
6|2|3|B|C
7|1|1|A|A
8|1|2|A|B
9|1|3|A|C

I'm lost...

in reply to:  4 ; comment:5 by mlennert, 8 years ago

Replying to veroandreo:

I applied the patch and tested.

  • For my data (615 points, 3600 areas split in 325 categories), I got Segmentation fault.

Yes, i just noticed this for the original test data (test_areas, test_points) as well...

  • For the data generated with the commands you just posted, the result looks fine when running v.vect.stats with -p flag.
GRASS 7.3.svn (nc_spm_08_grass7):~ > v.vect.stats -p points=gc type=centroid areas=grid_one_cat points_col=col countcol=count statscol=sum method=sum
Selecting points for each area...
 100%
area_cat|count|sum
1|9|18

However, if you remove that flag, to actually update the attribute table (which for this case it is supposed to have only one category), I get:

GRASS 7.3.svn (nc_spm_08_grass7):~ > v.db.select grid_one_cat
cat|row|col|rown|coln|count|sum
1|3|1|C|A|9|18
2|3|2|C|B||
3|3|3|C|C||
4|2|1|B|A||
5|2|2|B|B||
6|2|3|B|C||
7|1|1|A|A||
8|1|2|A|B||
9|1|3|A|C||

Is there something wrong with v.db.select or with v.edit? I followed all your steps. But even if I do v.db.select grid_one_cat before v.vect.stats (and after the v.edit step), I get all the old categories:

GRASS 7.3.svn (nc_spm_08_grass7):~ > v.db.select grid_one_cat
cat|row|col|rown|coln
1|3|1|C|A
2|3|2|C|B
3|3|3|C|C
4|2|1|B|A
5|2|2|B|B
6|2|3|B|C
7|1|1|A|A
8|1|2|A|B
9|1|3|A|C

I'm lost...

This is normal: the table is not changed by the v.category commands, only the cat values. So all the old entries are still there. To make it really clean, one would have to run v.db.connect -d, db.droptable, v.db.addtable after the v.category command.

in reply to:  5 comment:6 by mlennert, 8 years ago

Replying to mlennert:

Replying to veroandreo:

I applied the patch and tested.

  • For my data (615 points, 3600 areas split in 325 categories), I got Segmentation fault.

Yes, i just noticed this for the original test data (test_areas, test_points) as well...

I can get rid of the segfault, but results are still not good. I have the feeling the issue is more profound than what I initially thought, so it would be better if MarkusM had a look at it.

by mlennert, 8 years ago

comment:7 by mlennert, 8 years ago

Ah, when those itches itch...

Please try the second patch I just uploaded. It works for my two test cases.

I also modified the behavior concerning calculating averages. IIUC, the idea is that if original values are integers, the resulting average should also be integer. However, again IIUC, there was a floor() missing to actually make it integer after the 0.5. So I added this floor().

Please test if the results seem ok. We can then discuss the +0.5 issue...

comment:8 by veroandreo, 8 years ago

When I apply this second patch (having the first already applied), I get:

[veroandreo@192 grass-7.3.svn]$ patch -p0 < v_vect_stats_unique_cats.2.diff 
patching file vector/v.vect.stats/main.c
Hunk #2 FAILED at 100.
Hunk #3 FAILED at 461.
Hunk #4 succeeded at 660 (offset 2 lines).
2 out of 4 hunks FAILED — saving rejects to file vector/v.vect.stats/main.c.rej

If anyway I try to use v.vect.stats with the first test data (areas_test and points_test), I get Segmentation fault. But maybe that's because the patch is not fully/correctly applied? I'm using latest trunk revision (r69713M).

ps: thanks for clarification regarding v.category, v.db.update and so on

in reply to:  8 comment:9 by neteler, 8 years ago

Replying to veroandreo:

When I apply this second patch (having the first already applied),

You likely need to revert the first patch if the second was generated against SVN.

svn revert file(s)
# verify
svn status

Then apply the second alias new patch (I didn't try myself but this would be the common procedure).

comment:10 by veroandreo, 8 years ago

Ahhh! Thanks for the explanation :)

So, I applied the second patch "instead" of the first one, and it works as expected, at least for the counts of points in polygons. I tested both with the test data previously attached and with data of my own. Thanks so much! I will test more and report back :)

Regarding the statistics... I'm not sure how to, but even if the attribute data in the points is integer, wouldn't that be better to have the "real" average, standard deviation, etc., and not a truncated version of them??

in reply to:  10 comment:11 by mmetz, 8 years ago

Replying to veroandreo:

Ahhh! Thanks for the explanation :)

So, I applied the second patch "instead" of the first one, and it works as expected, at least for the counts of points in polygons. I tested both with the test data previously attached and with data of my own. Thanks so much! I will test more and report back :)

The patch to remove duplicate area categories seems correct to me, but does not account for the case when no area categories are found. This test is missing in the code anyway and should be done before removing duplicate area categories.

Regarding the statistics... I'm not sure how to, but even if the attribute data in the points is integer, wouldn't that be better to have the "real" average, standard deviation, etc., and not a truncated version of them??

Right. The stats that can be integer are "category number of lowest value", "category number of highest value", and "number of different values", all else should be floating point. Count is treated differently (not a method option but triggered with the count_column option). This "half" addition of 0.5 comes from r.neighbors in G6 which is ignored by r.neighbors in G7.

by mmetz, 8 years ago

patch against trunk to remove duplicate cats and to eliminate half

comment:12 by mmetz, 8 years ago

I have uploaded a patch (v_vect_stats_unique_cats_no_half.diff​) that correctly removes duplicate cats (similar to v_vect_stats_unique_cats.2.diff) and additionally eliminates the usage of half (adding 0.5 to the result). This half makes only sense if the result is to be converted to integer and if the result is a positive number. The result is always floating point, therefore adding 0.5 causes wrong results.

in reply to:  12 ; comment:13 by veroandreo, 8 years ago

Replying to mmetz:

I have uploaded a patch (v_vect_stats_unique_cats_no_half.diff​) that correctly removes duplicate cats (similar to v_vect_stats_unique_cats.2.diff) and additionally eliminates the usage of half (adding 0.5 to the result). This half makes only sense if the result is to be converted to integer and if the result is a positive number. The result is always floating point, therefore adding 0.5 causes wrong results.

Thanks Markus! I applied the patch and tested with Moritz's test data and examples, and also with my own data. Counts are fine and also stats :)

in reply to:  13 ; comment:14 by mmetz, 8 years ago

Replying to veroandreo:

Replying to mmetz:

I have uploaded a patch (v_vect_stats_unique_cats_no_half.diff​) that correctly removes duplicate cats (similar to v_vect_stats_unique_cats.2.diff) and additionally eliminates the usage of half (adding 0.5 to the result). This half makes only sense if the result is to be converted to integer and if the result is a positive number. The result is always floating point, therefore adding 0.5 causes wrong results.

Thanks Markus! I applied the patch and tested with Moritz's test data and examples, and also with my own data. Counts are fine and also stats :)

This is a bugfix and should be backported to the G7 release branches. Too late for 7.2?

in reply to:  14 ; comment:15 by mlennert, 8 years ago

Replying to mmetz:

Replying to veroandreo:

Replying to mmetz:

I have uploaded a patch (v_vect_stats_unique_cats_no_half.diff​) that correctly removes duplicate cats (similar to v_vect_stats_unique_cats.2.diff) and additionally eliminates the usage of half (adding 0.5 to the result). This half makes only sense if the result is to be converted to integer and if the result is a positive number. The result is always floating point, therefore adding 0.5 causes wrong results.

Thanks Markus! I applied the patch and tested with Moritz's test data and examples, and also with my own data. Counts are fine and also stats :)

This is a bugfix and should be backported to the G7 release branches. Too late for 7.2?

I would say no. We're just about to release RC1, so bugfixing is more than welcome. In addition, the fix will not change anything else outside of v.vect.stats and seems important enough to include.

I guess we would need some basic automatic tests to make sure that it doesn't modify other expected behavior ? ;-)

comment:16 by mmetz, 8 years ago

Resolution: fixed
Status: newclosed

In 69722:

v.vect.stats: fix #3184, also remove the erroneous 'half' parameter

comment:17 by mmetz, 8 years ago

In 69723:

v.vect.stats: fix #3184, also remove the erroneous 'half' parameter (backport from trunk r69722)

comment:18 by mmetz, 8 years ago

In 69724:

v.vect.stats: fix #3184, also remove the erroneous 'half' parameter (backport from trunk r69722)

in reply to:  15 ; comment:19 by mmetz, 8 years ago

Replying to mlennert:

Replying to mmetz:

Replying to veroandreo:

Replying to mmetz:

I have uploaded a patch (v_vect_stats_unique_cats_no_half.diff​) that correctly removes duplicate cats (similar to v_vect_stats_unique_cats.2.diff) and additionally eliminates the usage of half (adding 0.5 to the result). This half makes only sense if the result is to be converted to integer and if the result is a positive number. The result is always floating point, therefore adding 0.5 causes wrong results.

Thanks Markus! I applied the patch and tested with Moritz's test data and examples, and also with my own data. Counts are fine and also stats :)

This is a bugfix and should be backported to the G7 release branches. Too late for 7.2?

I would say no. We're just about to release RC1, so bugfixing is more than welcome. In addition, the fix will not change anything else outside of v.vect.stats and seems important enough to include.

OK, applied to trunk, relbr72, and relbr70 in r69722-4.

I guess we would need some basic automatic tests to make sure that it doesn't modify other expected behavior ? ;-)

I would wait for any new errors in the test suite. If other expected behaviour is no longer met, that expectation was wrong.

in reply to:  19 comment:20 by mlennert, 8 years ago

Replying to mmetz:

Replying to mlennert:

I guess we would need some basic automatic tests to make sure that it doesn't modify other expected behavior ? ;-)

I would wait for any new errors in the test suite. If other expected behaviour is no longer met, that expectation was wrong.

Well, there are no tests for v.vect.stats. That's what I meant by "we would need some basic automatic tests"....

Note: See TracTickets for help on using tickets.