Opened 8 years ago
Closed 8 years ago
#3720 closed defect (fixed)
geobuf regress error on debbie and winnie
Reported by: | robe | Owned by: | Björn Harrtell |
---|---|---|---|
Priority: | medium | Milestone: | PostGIS 2.4.0 |
Component: | postgis | Version: | master |
Keywords: | Cc: |
Description
Just noticed this on debbie. I'm rerunning to confirm it's still an issue.
geobuf .. failed (diff expected obtained: /var/lib/jenkins/workspace/postgis/tmp/2_4_pg9.6w64/test_117_diff) ----------------------------------------------------------------------------- --- geobuf_expected 2017-02-26 18:37:40.110686758 +0000 +++ /var/lib/jenkins/workspace/postgis/tmp/2_4_pg9.6w64/test_117_out 2017-02-26 18:41:54.722689185 +0000 @@ -3,9 +3,18 @@ dF9mbG9hdBgAIjoKOAoICAIaBAICAgJqBgoEdGVzdGoCGAFqAiABagUKAzEuMWoJEZqZmZmZmfE/ cgoAAAEBAgIDAwQE T3|GAAiEAoOCgwIBBoIAAAAAgIAAAE= -T4|GAAiHAoaChgIBBICBAQaEAAAAAoKAAAJAgIAAgIAAAE= +T4|GAAirwEKrAEKqQEIBBICBAQaoAH///////////8B////////////Af///////////wH///////// +//8B////////////Af///////////wH///////////8B////////////Af///////////wH///// +//////8B////////////Af///////////wH///////////8B////////////Af///////////wH/ +//////////8B T5|GAAiEAoOCgwIARoIFFA8EycTFBM= -T6|GAAiGgoYChYIAxICAwQaDhQUFBQTKFBQExMUExMT -T7|GAAiJgokCiIIBRIGAgEDAgUDGhZQUCcKMh0oRhMJACcoCR4ePCgTCQAU +T6|GAAimwEKmAEKlQEIAxICAwQajAH///////////8B////////////Af///////////wH///////// +//8B////////////Af///////////wH///////////8B////////////Af///////////wH///// +//////8B////////////Af///////////wH///////////8B////////////AQ== +T7|GAAi7wEK7AEK6QEIBRIGAgEDAgUDGtwB////////////Af///////////wH///////////8B//// +////////Af///////////wH///////////8B////////////Af///////////wH///////////8B +////////////Af///////////wH///////////8B////////////Af///////////wH///////// +//8B////////////Af///////////wH///////////8B////////////Af///////////wH///// +//////8B////////////AQ== T8|GAAiGAoWChQIBiIGCAAaAggMIggIAhoECAwGCA== T9|EAMYACILCgkKBwgAGgMCBAY= -----------------------------------------------------------------------------
Winnie has got other issues but that might be because I'm in middle of upgrading her GDAL among other things.
Change History (31)
comment:1 by , 8 years ago
Owner: | changed from | to
---|
comment:2 by , 8 years ago
comment:3 by , 8 years ago
Summary: | geobuf regress error on debbie → geobuf regress error on debbie and winnie |
---|
just checked. Winnie shows the same error. Perhaps you forgot a regress file commit?
comment:4 by , 8 years ago
Looks like I'm getting unexpected output, cause unknown but will investigate further.
comment:5 by , 8 years ago
Wierd, the expected output "GAAiHAoaChgIBBICBAQaEAAAAAoKAAAJAgIAAgIAAAE=" seems reliably produced locally and also on Debbie except for 9.6. Using 9.6 locally it's not that simple.
comment:6 by , 8 years ago
Ah, Debbie gives up on fail so what I'm seeing at Debbie are old successful builds for < 9.6.
comment:8 by , 8 years ago
Now I really need to get to know Debbie better. Must be something different to my local setup and the Travis setup. Could revert ST_AsGeobuf but if it's ok I'll leave it in while I try to reproduce the problem. If I can't reproduce/fix in a few days I'll revert.
comment:9 by , 8 years ago
One guess is that this is a protobuf related dependency problem, so I'd like to know specifically which packages/sources for protobuf stuff are being used at Debbie.
comment:10 by , 8 years ago
Might be my code but still suspect protobuf dep version. A perhaps good lead is that it looks like the problem only occurs for geometries with several linestrings/rings like the T4 and T6 tests. The T5 test passes ok.
comment:11 by , 8 years ago
Would it be possible to spit out the version as part of configure like we have for others or is that really tricky.
-------------- Dependencies -------------- GEOS config: /var/lib/jenkins/workspace/geos/rel-3.7.0devw64/bin/geos-config GEOS version: 3.7.0dev GDAL config: /var/lib/jenkins/workspace/gdal/rel-2.1w64/bin/gdal-config GDAL version: 2.1.3 SFCGAL config: /usr/bin/sfcgal-config SFCGAL version: 1.3.0 PostgreSQL config: /var/lib/jenkins/workspace/pg/rel/pg9.6w64/bin/pg_config PostgreSQL version: PostgreSQL 9.6.2 PROJ4 version: 49 Libxml2 config: /usr/bin/xml2-config Libxml2 version: 2.9.4 JSON-C support: yes protobuf-c support: yes PCRE support: yes PostGIS debug level: 0 Perl: /usr/bin/perl
I forget which apt packages I installed on debbie - it's all a fog.
I know on winnie, since I had to compile that from scratch, I used: protobuf-2.6.1 and protobuf-c-1.1.1
Though winnie is on windows, she's exhibiting the same error, so presume it must be an issue with that combination as well.
comment:12 by , 8 years ago
OOps I misspoke about winnie, forgot I did twice, and used newer second time.
So for winnie I used protobuf 3.2.0 and protobuf-c 1.2.1
output from debbie:
protoc --version libprotoc 3.0.0
How do I get the other details or is that good enough?
comment:13 by , 8 years ago
Good enough for now. I will look into getting version numbers out in the dependency summary. It's a bit messy as the C lib has a transitive dependency on the C++ lib. I've got libprotobuf-c 1.2.1 and libprotobuf 3.0.0 on my system so it seems you got the same at Debbie and I should look probably elsewhere to explain the problem.
comment:14 by , 8 years ago
It would be good to know if the unexpected output from the test (which looks like garbage) is identical for successive builds and/or between debbie and winnie. If random I'm probably writing outside past allocated memory, if identical I'm even more confused.
comment:15 by , 8 years ago
Ok so I got a new build from Debbie just now, still erroring. The output is:
geobuf .. failed (diff expected obtained: /var/lib/jenkins/workspace/postgis/tmp/2_4_pg9.6w64/test_117_diff) ----------------------------------------------------------------------------- --- geobuf_expected 2017-02-26 18:37:40.110686758 +0000 +++ /var/lib/jenkins/workspace/postgis/tmp/2_4_pg9.6w64/test_117_out 2017-02-28 20:46:04.812781870 +0000 @@ -3,9 +3,18 @@ dF9mbG9hdBgAIjoKOAoICAIaBAICAgJqBgoEdGVzdGoCGAFqAiABagUKAzEuMWoJEZqZmZmZmfE/ cgoAAAEBAgIDAwQE T3|GAAiEAoOCgwIBBoIAAAAAgIAAAE= -T4|GAAiHAoaChgIBBICBAQaEAAAAAoKAAAJAgIAAgIAAAE= +T4|GAAirwEKrAEKqQEIBBICBAQaoAH///////////8B////////////Af///////////wH///////// +//8B////////////Af///////////wH///////////8B////////////Af///////////wH///// +//////8B////////////Af///////////wH///////////8B////////////Af///////////wH/ +//////////8B T5|GAAiEAoOCgwIARoIFFA8EycTFBM= -T6|GAAiGgoYChYIAxICAwQaDhQUFBQTKFBQExMUExMT -T7|GAAiJgokCiIIBRIGAgEDAgUDGhZQUCcKMh0oRhMJACcoCR4ePCgTCQAU +T6|GAAimwEKmAEKlQEIAxICAwQajAH///////////8B////////////Af///////////wH///////// +//8B////////////Af///////////wH///////////8B////////////Af///////////wH///// +//////8B////////////Af///////////wH///////////8B////////////AQ== +T7|GAAi7wEK7AEK6QEIBRIGAgEDAgUDGtwB////////////Af///////////wH///////////8B//// +////////Af///////////wH///////////8B////////////Af///////////wH///////////8B +////////////Af///////////wH///////////8B////////////Af///////////wH///////// +//8B////////////Af///////////wH///////////8B////////////Af///////////wH///// +//////8B////////////AQ== T8|GAAiGAoWChQIBiIGCAAaAggMIggIAhoECAwGCA== T9|EAMYACILCgkKBwgAGgMCBAY= -----------------------------------------------------------------------------
comment:16 by , 8 years ago
Diff result:
< +++ /var/lib/jenkins/workspace/postgis/tmp/2_4_pg9.6w64/test_117_out 2017-02-26 18:41:54.722689185 +0000 --- > +++ /var/lib/jenkins/workspace/postgis/tmp/2_4_pg9.6w64/test_117_out 2017-02-28 20:46:04.812781870 +0000
I.e. only time differs… Should mean this is no accidental memory garbage problem. Also means I have no clue what is causing this strange output. It does not even look base64 encoded.
comment:17 by , 8 years ago
I see 2_4_pg9.6w64. I thought this was Debbie builds but perhaps I'm looking at Winnie builds here? I would be less surprised if there is something strange going on in Windows-land only.
comment:18 by , 8 years ago
Installed a clean Debian Jessie (8.7.1), protobuf 3.0.2 and protobuf-c 1.2.1 then built postgis svn-trunk and ran tests. Cannot reproduce issue.
The only diff I can spot from the logs is that Debbie has GCC 5 and PG 9.6.2 but I use Jessie defaults (GCC 4.9.2 and PG 9.4.10). On my Debian Stretch I got GCC 6.3.0 and PG 9.6.2.
We also have passing build&test at Travis and GitLab pipeline both using slightly different systems and package versions.
I'm not sure how to proceed.
comment:19 by , 8 years ago
As there seems to be no short term solution I'm considering reverting ST_AsGeobuf.
@robe, as I haven't found a way to reproduce the issue is it possible to setup debbie/winnie to build a branch which hopefully can be fixed at some point?
comment:20 by , 8 years ago
Björn Harrtell,
If you can't replicate on your end, can you disable the failing test for now. I don't think it's worthwhile reverting ST_AsGeoBuf.
Unfortunately I'm tied up for a bit, probably next two weeks, so might not be able to setup a branch before then for testing. The only thing I can think of that both Debbie and Winnie have in common is they both use PostgreSQL compiled with cassert on. I don't think the zlib part is important since I think I have on on winnie.
So I use this command on debbie looks like this to build postgres
./configure --prefix=${WORKSPACE}/rel/pg${PG_VER}w${OS_BUILD} --with-pgport=${PGPORT} --enable-cassert --enable-debug --without-zlib
enable-debug and cassert picks up issues you might not normally see with memory not being properly cleaned.
strk - any chance you can try replicating. You have cassert enabled too right. So you might be seeing the same issue.
comment:21 by , 8 years ago
One more thing. I didn't check to see, do you have cunit tests for geobuf. cunit tests are helpful for removing PostgreSQL out of the mix as a culprit.
We have even had in past where postgres cassert logic had a bug in it that we reported upstream to fix. So I wouldn't rule out an issue with upstream.
comment:22 by , 8 years ago
Unfortunately I have not been able to make cunit tests for ST_AsGeobuf (or ST_AsMVT). To me, it would significantly increase the complexity of the implementation. I know Paul would love to have cunit tests and I hate to disappoint so this has not been an easy/lazy desicion.
Thanks for the info about your build on Debbie/Winnie, I'll investigate. Memory allocation is definitely an area where I might need to check and double check the logic.
comment:23 by , 8 years ago
Reproduced the error after building and running tests with PostgreSQL 9.6.2 —enable-cassert —enable-debug.
comment:24 by , 8 years ago
Very strange. I get this wierd output on T4 which encodes a POLYGON((0 0,0 5,5 5,5 0,0 0), (1 1,1 2,2 2,2 1,1 1)). If I remove the inner ring output seems ok. If I keep the ring but force the encoding to only consider the first ring output is wierd but different than the first case. Almost as if the LWGEOM have different semantics if there are multiple rings.
comment:25 by , 8 years ago
Sounds like a memory error to me. Not easy to spot because of pooled
memory allocation (which is why we love unit tests). Try valgrinding
PostgreSQL and see what you get. It happened more than once to me that
the problem was in a PG_FREE_IF_COPY(<wrong_argument_here>)
comment:27 by , 8 years ago
I'll have to learn valgrind sometime. And get better at writing unit testable C code. In the meantime I've pinpointed the problem. Because of the nature of Geobuf I need to analyze all input geometries and encode them as the final aggregation step based on the max precision found when analyzing. So I save LWGEOM instances from lwgeom_from_gserialized in each aggregation step and it seems it the memory allocated in this process is going out of scope before the final aggregation step. I don't understand why it only manifest itself on geometries with multiple rings, and how the output can be perfectly consistent.
The problem disappears if I make a lwgeom_clone_deep before saving it in the aggregation context. This is surprising to me as it uses the exact same allocation method as the lwgeom_from_gserialized function. I'm not really happy with this fix so I'll keep looking into this issue.
It's possible I'm doing something wrong regarding the MemoryContext handling that is important in aggregration functions but I can't I understand it fully.
comment:28 by , 8 years ago
lwgeom_from_gserialized keeps references to the memory allocated by PG_DETOAST_DATUM, which *might* or might *not* allocate new memory depending on the size of the geometry (thus multi-ring vs. single-ring might make a difference). If you want to be sure to get a copy (assuming you switched the context before calling the macro) you can use PG_DETOAST_DATUM_COPY instead.
That said, I dunno if the whole context switching is done correctly as I see the transition function switching to the aggregate context and never switching back. See for comparison pgis_geometry_accum_transfn
comment:29 by , 8 years ago
It seems I can replace lwgeom_clone_deep with PG_DETOAST_DATUM_COPY.
I'm mostly confused about the MemoryContext, initially I tried to do something similar I've seen with the switching back and forth but could not make it work. There is something in the fundamental concept of it I do not understand.
comment:31 by , 8 years ago
Resolution: | → fixed |
---|---|
Status: | new → closed |
Version: | 2.3.x → trunk |
Closing this now as Debbie is happy.
Okay still a problem. Started erroring at r15317