#3184 closed defect (fixed)
v.vect.stats: errors in counts and statistics
Reported by: | veroandreo | Owned by: | |
---|---|---|---|
Priority: | normal | Milestone: | 7.2.0 |
Component: | Vector | Version: | svn-trunk |
Keywords: | v.vect.stats | Cc: | |
CPU: | x86-64 | Platform: | Linux |
Description
I am working with cases of a disease in Greece. So, I have points (coordinates) and I need to get a count of those cases per municipality and higher administrative units, i.e: NUTS. However, v.vect.stats gives wrong counts. In, my case, I should have 615 cases and if I sum the cases counted by v.vect.stats (using v.db.select mymap column=cases), I get 767. In some polygons, it counts ok, in some it misses points, in others it "invents" extra points.
I guess the whole problem is maybe related to the fact that Greece has so many islands split in a much lower number of categories (i.e.: belonging to a reduced number of administrative units), but I don't know.
Anyway, I attach a vector of polygons and a vector of points as an example. They were created in NC demo location. Just unpack and run:
v.vect.stats -p points=points_test areas=areas_test method=average points_column=value count_column=count stats_column=average
This is the output:
area_cat|count|average 1|3|5.5 2|0|null 3|6|11.3333333333333 3|0|null 6|2|10.5
Printing to stout gives two lines for cat 3, one with the right count, one with 0. However, G7:v.db.select gives a different output:
v.db.select areas_test cat|count|average 1|3|5.5 2|0| 3|0| 6|2|10.5
If you query the map from the GUI or display the attribute table, polygons with cat 3 have 0 count (see attached screenshot).
Not only the count is wrong, but the average seems also weird. The three points in the polygons with cat 1 have a value of 5, why would average be 5.5?
I'm running GRASS trunk r69699 under Fedora 24.
Attachments (6)
Change History (26)
Changed 4 years ago by
Attachment: | areas_test.pack added |
---|
Changed 4 years ago by
Attachment: | points_test.pack added |
---|
vector of points to perform counts and statistics
Changed 4 years ago by
Attachment: | v_vect_stats_errors.png added |
---|
screenshot showing areas and points
comment:1 Changed 4 years ago by
Here's another example with internally generated data:
g.region res=1 n=3 s=0 w=0 e=3 -p v.mkgrid grid=3,3 map=grid v.extract grid type=centroid out=gc v.category grid type=centroid out=grid_no_cats cat=-1 op=del g.copy vect=grid_no_cats,grid_one_cat v.edit grid_one_cat type=centroid tool=catadd cat=1 bbox=0,0,3,3 v.vect.stats -p points=gc type=centroid areas=grid points_col=col countcol=count statscol=sum method=sum Selecting points for each area... 100% area_cat|count|sum 1|1|1 2|1|2 3|1|3 4|1|1 5|1|2 6|1|3 7|1|1 8|1|2 9|1|3 v.vect.stats -p points=gc type=centroid areas=grid_one_cat points_col=col countcol=count statscol=sum method=sum Selecting points for each area... 100% area_cat|count|sum 1|0|null 1|0|null 1|9|18 1|0|null 1|0|null
IOW, with all grid cells having the same cat value, results become nonsense except for one case. However, this one case is not the one uploaded to the vector attribute table when one doesn't use the '-p' flag.
comment:2 Changed 4 years ago by
Summary: | v.ect.stats: errors in counts and statistics → v.vect.stats: errors in counts and statistics |
---|
comment:3 Changed 4 years ago by
I think I found the issue.
When running with g.gisenv set=DEBUG=1, I get:
v.vect.stats -p points=gc type=centroid areas=grid_one_cat points_col=col countcol=count statscol=sum method=sum [...] D1/1: 9 cats loaded from vector (including duplicates) D1/1: 5 cats loaded from vector (unique) [...]
There should be only 1 unique cat.
The attached patch seems to solve this for me, but it needs testing and review.
Changed 4 years ago by
Attachment: | v_vect_stats_unique_cats.diff added |
---|
patch (against trunk) to correctly uniquify list of cats
comment:4 follow-up: 5 Changed 4 years ago by
I applied the patch and tested.
- For my data (615 points, 3600 areas split in 325 categories), I got Segmentation fault.
- For the data generated with the commands you just posted, the result looks fine when running v.vect.stats with -p flag.
GRASS 7.3.svn (nc_spm_08_grass7):~ > v.vect.stats -p points=gc type=centroid areas=grid_one_cat points_col=col countcol=count statscol=sum method=sum Selecting points for each area... 100% area_cat|count|sum 1|9|18
However, if you remove that flag, to actually update the attribute table (which for this case it is supposed to have only one category), I get:
GRASS 7.3.svn (nc_spm_08_grass7):~ > v.db.select grid_one_cat cat|row|col|rown|coln|count|sum 1|3|1|C|A|9|18 2|3|2|C|B|| 3|3|3|C|C|| 4|2|1|B|A|| 5|2|2|B|B|| 6|2|3|B|C|| 7|1|1|A|A|| 8|1|2|A|B|| 9|1|3|A|C||
Is there something wrong with v.db.select or with v.edit? I followed all your steps. But even if I do v.db.select grid_one_cat
before v.vect.stats (and after the v.edit step), I get all the old categories:
GRASS 7.3.svn (nc_spm_08_grass7):~ > v.db.select grid_one_cat cat|row|col|rown|coln 1|3|1|C|A 2|3|2|C|B 3|3|3|C|C 4|2|1|B|A 5|2|2|B|B 6|2|3|B|C 7|1|1|A|A 8|1|2|A|B 9|1|3|A|C
I'm lost...
comment:5 follow-up: 6 Changed 4 years ago by
Replying to veroandreo:
I applied the patch and tested.
- For my data (615 points, 3600 areas split in 325 categories), I got Segmentation fault.
Yes, i just noticed this for the original test data (test_areas, test_points) as well...
- For the data generated with the commands you just posted, the result looks fine when running v.vect.stats with -p flag.
GRASS 7.3.svn (nc_spm_08_grass7):~ > v.vect.stats -p points=gc type=centroid areas=grid_one_cat points_col=col countcol=count statscol=sum method=sum Selecting points for each area... 100% area_cat|count|sum 1|9|18However, if you remove that flag, to actually update the attribute table (which for this case it is supposed to have only one category), I get:
GRASS 7.3.svn (nc_spm_08_grass7):~ > v.db.select grid_one_cat cat|row|col|rown|coln|count|sum 1|3|1|C|A|9|18 2|3|2|C|B|| 3|3|3|C|C|| 4|2|1|B|A|| 5|2|2|B|B|| 6|2|3|B|C|| 7|1|1|A|A|| 8|1|2|A|B|| 9|1|3|A|C||Is there something wrong with v.db.select or with v.edit? I followed all your steps. But even if I do
v.db.select grid_one_cat
before v.vect.stats (and after the v.edit step), I get all the old categories:GRASS 7.3.svn (nc_spm_08_grass7):~ > v.db.select grid_one_cat cat|row|col|rown|coln 1|3|1|C|A 2|3|2|C|B 3|3|3|C|C 4|2|1|B|A 5|2|2|B|B 6|2|3|B|C 7|1|1|A|A 8|1|2|A|B 9|1|3|A|CI'm lost...
This is normal: the table is not changed by the v.category commands, only the cat values. So all the old entries are still there. To make it really clean, one would have to run v.db.connect -d, db.droptable, v.db.addtable after the v.category command.
comment:6 Changed 4 years ago by
Replying to mlennert:
Replying to veroandreo:
I applied the patch and tested.
- For my data (615 points, 3600 areas split in 325 categories), I got Segmentation fault.
Yes, i just noticed this for the original test data (test_areas, test_points) as well...
I can get rid of the segfault, but results are still not good. I have the feeling the issue is more profound than what I initially thought, so it would be better if MarkusM had a look at it.
Changed 4 years ago by
Attachment: | v_vect_stats_unique_cats.2.diff added |
---|
comment:7 Changed 4 years ago by
Ah, when those itches itch...
Please try the second patch I just uploaded. It works for my two test cases.
I also modified the behavior concerning calculating averages. IIUC, the idea is that if original values are integers, the resulting average should also be integer. However, again IIUC, there was a floor() missing to actually make it integer after the 0.5. So I added this floor().
Please test if the results seem ok. We can then discuss the +0.5 issue...
comment:8 follow-up: 9 Changed 4 years ago by
When I apply this second patch (having the first already applied), I get:
[veroandreo@192 grass-7.3.svn]$ patch -p0 < v_vect_stats_unique_cats.2.diff patching file vector/v.vect.stats/main.c Hunk #2 FAILED at 100. Hunk #3 FAILED at 461. Hunk #4 succeeded at 660 (offset 2 lines). 2 out of 4 hunks FAILED — saving rejects to file vector/v.vect.stats/main.c.rej
If anyway I try to use v.vect.stats with the first test data (areas_test and points_test), I get Segmentation fault. But maybe that's because the patch is not fully/correctly applied? I'm using latest trunk revision (r69713M).
ps: thanks for clarification regarding v.category, v.db.update and so on
comment:9 Changed 4 years ago by
Replying to veroandreo:
When I apply this second patch (having the first already applied),
You likely need to revert the first patch if the second was generated against SVN.
svn revert file(s) # verify svn status
Then apply the second alias new patch (I didn't try myself but this would be the common procedure).
comment:10 follow-up: 11 Changed 4 years ago by
Ahhh! Thanks for the explanation :)
So, I applied the second patch "instead" of the first one, and it works as expected, at least for the counts of points in polygons. I tested both with the test data previously attached and with data of my own. Thanks so much! I will test more and report back :)
Regarding the statistics... I'm not sure how to, but even if the attribute data in the points is integer, wouldn't that be better to have the "real" average, standard deviation, etc., and not a truncated version of them??
comment:11 Changed 4 years ago by
Replying to veroandreo:
Ahhh! Thanks for the explanation :)
So, I applied the second patch "instead" of the first one, and it works as expected, at least for the counts of points in polygons. I tested both with the test data previously attached and with data of my own. Thanks so much! I will test more and report back :)
The patch to remove duplicate area categories seems correct to me, but does not account for the case when no area categories are found. This test is missing in the code anyway and should be done before removing duplicate area categories.
Regarding the statistics... I'm not sure how to, but even if the attribute data in the points is integer, wouldn't that be better to have the "real" average, standard deviation, etc., and not a truncated version of them??
Right. The stats that can be integer are "category number of lowest value", "category number of highest value", and "number of different values", all else should be floating point. Count is treated differently (not a method option but triggered with the count_column option). This "half" addition of 0.5 comes from r.neighbors in G6 which is ignored by r.neighbors in G7.
Changed 4 years ago by
Attachment: | v_vect_stats_unique_cats_no_half.diff added |
---|
patch against trunk to remove duplicate cats and to eliminate half
comment:12 follow-up: 13 Changed 4 years ago by
I have uploaded a patch (v_vect_stats_unique_cats_no_half.diff) that correctly removes duplicate cats (similar to v_vect_stats_unique_cats.2.diff) and additionally eliminates the usage of half (adding 0.5 to the result). This half makes only sense if the result is to be converted to integer and if the result is a positive number. The result is always floating point, therefore adding 0.5 causes wrong results.
comment:13 follow-up: 14 Changed 4 years ago by
Replying to mmetz:
I have uploaded a patch (v_vect_stats_unique_cats_no_half.diff) that correctly removes duplicate cats (similar to v_vect_stats_unique_cats.2.diff) and additionally eliminates the usage of half (adding 0.5 to the result). This half makes only sense if the result is to be converted to integer and if the result is a positive number. The result is always floating point, therefore adding 0.5 causes wrong results.
Thanks Markus! I applied the patch and tested with Moritz's test data and examples, and also with my own data. Counts are fine and also stats :)
comment:14 follow-up: 15 Changed 4 years ago by
Replying to veroandreo:
Replying to mmetz:
I have uploaded a patch (v_vect_stats_unique_cats_no_half.diff) that correctly removes duplicate cats (similar to v_vect_stats_unique_cats.2.diff) and additionally eliminates the usage of half (adding 0.5 to the result). This half makes only sense if the result is to be converted to integer and if the result is a positive number. The result is always floating point, therefore adding 0.5 causes wrong results.
Thanks Markus! I applied the patch and tested with Moritz's test data and examples, and also with my own data. Counts are fine and also stats :)
This is a bugfix and should be backported to the G7 release branches. Too late for 7.2?
comment:15 follow-up: 19 Changed 4 years ago by
Replying to mmetz:
Replying to veroandreo:
Replying to mmetz:
I have uploaded a patch (v_vect_stats_unique_cats_no_half.diff) that correctly removes duplicate cats (similar to v_vect_stats_unique_cats.2.diff) and additionally eliminates the usage of half (adding 0.5 to the result). This half makes only sense if the result is to be converted to integer and if the result is a positive number. The result is always floating point, therefore adding 0.5 causes wrong results.
Thanks Markus! I applied the patch and tested with Moritz's test data and examples, and also with my own data. Counts are fine and also stats :)
This is a bugfix and should be backported to the G7 release branches. Too late for 7.2?
I would say no. We're just about to release RC1, so bugfixing is more than welcome. In addition, the fix will not change anything else outside of v.vect.stats and seems important enough to include.
I guess we would need some basic automatic tests to make sure that it doesn't modify other expected behavior ? ;-)
comment:19 follow-up: 20 Changed 4 years ago by
Replying to mlennert:
Replying to mmetz:
Replying to veroandreo:
Replying to mmetz:
I have uploaded a patch (v_vect_stats_unique_cats_no_half.diff) that correctly removes duplicate cats (similar to v_vect_stats_unique_cats.2.diff) and additionally eliminates the usage of half (adding 0.5 to the result). This half makes only sense if the result is to be converted to integer and if the result is a positive number. The result is always floating point, therefore adding 0.5 causes wrong results.
Thanks Markus! I applied the patch and tested with Moritz's test data and examples, and also with my own data. Counts are fine and also stats :)
This is a bugfix and should be backported to the G7 release branches. Too late for 7.2?
I would say no. We're just about to release RC1, so bugfixing is more than welcome. In addition, the fix will not change anything else outside of v.vect.stats and seems important enough to include.
OK, applied to trunk, relbr72, and relbr70 in r69722-4.
I guess we would need some basic automatic tests to make sure that it doesn't modify other expected behavior ? ;-)
I would wait for any new errors in the test suite. If other expected behaviour is no longer met, that expectation was wrong.
comment:20 Changed 4 years ago by
Replying to mmetz:
Replying to mlennert:
I guess we would need some basic automatic tests to make sure that it doesn't modify other expected behavior ? ;-)
I would wait for any new errors in the test suite. If other expected behaviour is no longer met, that expectation was wrong.
Well, there are no tests for v.vect.stats. That's what I meant by "we would need some basic automatic tests"....
vector of areas to perform count