Opened 5 years ago

Last modified 16 months ago

#4684 new defect

concurrent topology construction routine may result in invalid topology

Reported by: Lars Aksel Opsahl Owned by: strk
Priority: medium Milestone: PostGIS Fund Me
Component: topology Version: 3.0.x
Keywords: reliability Cc:

Description

When using topology.TopoGeo_addLinestring we some times end with invalid topology.

It's seem to happen more often when running i multithread environment than in single threaded environment. Running the same code on the same data set may produce different results each time. The result of this are lost surface or that two surfaces are merged together as one. The edges are in most cases represented in the the edge table the data about left_face and right_face wrong. I many case we right_face and left_face are equal and different from 0, that means an dangling egde, but it's not because we can construct a Polygon based on the edges inside the face where spatially located.

There is more detailed information here. https://github.com/larsop/resolve-overlap-and-gap/issues/7

There is also an related bug here https://trac.osgeo.org/postgis/ticket/4681#ticket

Attachments (2)

db_topo_rowlevel_lock.zip (5.9 KB ) - added by Lars Aksel Opsahl 4 years ago.
A file to show how row level lock can be used to used avoid topo errors
db_topo_bug_4684.zip (111.6 KB ) - added by Lars Aksel Opsahl 4 years ago.
There will not generate any errors.

Download all attachments as: .zip

Change History (38)

comment:1 by strk, 5 years ago

First of all: it is _expected_ that concurrent processes (threads would result in backends, so processes) may end up creating an invalid topology. This is because the topology primitive tables DO NOT have proper integrity constraints ensuring topology validity. They do have partial constraints but not full-coverage of constraints. This was a design choice, initially, to avoid an excessive cost on topology population when the caller knows what he's doing.

We *might* be able to change this now, counting on the ability of caller to disable (temporary?) those constraints. Anyway this is the current situation.

The fact that you end up with different results is ALSO due to this multi-process approach. Depending on what gets into the primitive tables first determines the behaviour of the next operation.

So my question is: can you get topology.TopoGeo_addLinestring to create an invalid topology when being run by a backend having exclusive write access to the database ? If you can NOT, then this would be an INVALID ticket.

comment:2 by Lars Aksel Opsahl, 5 years ago

So far I have only tested in a multi thread environment .

If I read you correct I have to be able produce an invalid Topology by running in a single thread ?

Can I run in partly in multiple threads by doing it this way ?

1) Break the input up into many grid cell
2) Insert lines lines that are inside each cell and with min a distance like tolerance to cell border
3) Insert the rest of the lines in a single thread 

Lars

comment:3 by strk, 5 years ago

If the broken input NEVER intersects the grid cell borders then your solution *should* work, based on the fact that NO RING (and thus no face) would ever cross a cell.

Just make sure the topology is valid BEFORE step 3.

comment:4 by Lars Aksel Opsahl, 5 years ago

I have been testing a quite bit on this branch https://github.com/larsop/resolve-overlap-and-gap/tree/test_not_cut_cellborder_crossing_edges_issues_7 where I follow the the steps above and I have not been able to make any with faces with null geometries running step 3 in single thread.

So it's seems to be correct what you are saying here.

Just to confirm this I this I changed step 3 to run parallel https://github.com/larsop/resolve-overlap-and-gap/commit/57ef8b9db710da6c8c023d60b8b617f202502e4b and then I got 67 faces with null geo of 308413 faces.

I also tried to creating blocking area based on mbr in the face table https://github.com/larsop/resolve-overlap-and-gap/commit/b564bdb6d3e77cf064e11cb4418fc41ab5fcddf3 but that did not help very much.

Last edited 5 years ago by Lars Aksel Opsahl (previous) (diff)

comment:5 by strk, 5 years ago

Component: postgistopology
Owner: changed from pramsey to strk
Summary: Topology construction routine results invalid topologyconcurrent topology construction routine may result in invalid topology

There might be a way, using PostgreSQL locks, to reduce the chances (or completely avoid) that such invalidities are introduced in the topology. Serializing access to the topology primitives could be overkill becuase it would prevent the 3 steps approach above. What we want to avoid is that more than ONE transaction at the time works on the same edges. Row level locks might help with that: https://www.postgresql.org/docs/10/explicit-locking.html#LOCKING-ROWS

comment:6 by Lars Aksel Opsahl, 5 years ago

This sounds like a good idea.

So if I have a set of lines to add, how do I find the edge rows and maybe face rows to lock ?

(I have tried to block other threads based on input geometry and faces, but I never got it to work 100%, but if remove this blocking totally in step 3 https://github.com/larsop/resolve-overlap-and-gap/commit/874f843c58921c5e2fca18240aef60f15bfc6b58#diff-0ec63b6826e9888305fd67d50bf428e1 Postgres core dumps quite fast)

Last edited 5 years ago by Lars Aksel Opsahl (previous) (diff)

comment:7 by strk, 5 years ago

So if I have a set of lines to add, how do I find the edge row and maybe face rows to lock ?

Not easy to tell. When adding ONE linestring, records that will possibly be updated are:

[topo.edge_data]

  • edges that will intersect the line (easy to get)
  • edges of any face that will be affected by the insertion (hard to find)

[topo.face]

  • faces that will be split by the insertion (hard to find)

[topo.node]

  • isolated nodes that are within tolerance distance from the line

A pessimistic approach might lock:

  • EVERY FACE whos MBR intersects the input line
  • EVERY EDGE having any of those faces on its right or left side
  • EVERY ISOLATED NODE within tolerance distance from the input line

comment:8 by Lars Aksel Opsahl, 5 years ago

Thanks, I will try to test it this week.

comment:9 by Lars Aksel Opsahl, 5 years ago

I have been testing the pessimistic approach using this guide line from Sandro

EVERY FACE whos MBR intersects the input line, EVERY EDGE having any of those faces on its right or left side, EVERY ISOLATED NODE within tolerance distance from the input line ,

On this branch https://github.com/larsop/resolve-overlap-and-gap/tree/trac_osgeo_org_postgis_ticket_4684 and what ever I tried I ended up with st_getFaceGeometry that was null.

Here I https://github.com/larsop/resolve-overlap-and-gap/commit/7d71dbe9a71f099d2c4a8ec116ba49f7e749dcfb tested with row-level and no change in blocking area and we get null geometry and it seems to increase with the number of parallel threads.

Here I https://github.com/larsop/resolve-overlap-and-gap/commit/8ef48c02587c5aff0f60e4a05339ca82619f9446 extend the blocking area and I also did row level lock and that seems to generate less errors, but I still get cases where st_getFaceGeometry is null, but fewer.

It seems like more deadlocks are increasing the cases off null geometries from st_getFaceGeometry.

I will do some more checking around deadlocks and test rollback when that happens

comment:10 by Lars Aksel Opsahl, 5 years ago

Here https://github.com/larsop/resolve-overlap-and-gap/commit/dc81279b969b7da75848b1e53f79729beb746469 I did a test where I catch deadlock and does rollback, but I still got cases where st_getFaceGeometry was null.

comment:11 by Lars Aksel Opsahl, 5 years ago

When https://github.com/larsop/resolve-overlap-and-gap/commit/d6eccb1c5bd7a1345a81eea65707bb179eab6355 testing with bigger blocking area and also use this area blocking more rows in edge and face tables, I had case where I got no deadlocks , but still got cases where st_getFaceGeometry was null.

comment:12 by strk, 4 years ago

It looks like my pessimistic approach was not pessimistic enough, as it forgot to deal with the hard step of finding:

  • edges of any face that will be affected by the insertion (hard to find)

Let's consider this simple starting condition:

  • Topology has a single face, bound by a single closed edge
  • A second edge goes around the existing face but is NOT closed
             ____
            /    \
           /      \
          /  o     \ E2
         /  / \     \
        /  /   \ E1  \
       /  /     \     \
      /  /  F1   \     \
     /  /         \     \
    /   `---------'      \  
   /                 F0   \
  /                        \
  `----o          o--------'

Now, two separate transactions want to add each a new edge. Let's call them E3 and E4:

             ____
            /    \
           /      \
          /  o     \ E2
         /  / \     \
        /  /   \ E1  \
       /  /     \     \
      /  /  F1   \     \
     /  /         \     \
    /   `---------'      \ 
   /                 F0   \
  /                        \
  `----o---(E3)---o--------'
       |          |
       `---(E4)---'

Both transactions will want to update E2 labeling and edge linking but they will be conflicting. Shall E2 be connected to E3 or E4 ? Whichever transaction is committed last will decide, leaving effectively the topology in an invalid state.

Scenarios like the above should be used for testing row-level locking. Eventually we might want to implement row-level locking in PostGIS Topology core.

How to solve the above ? We need to ALSO lock E2, but neither E3 nor E4 overlap any faces bound by E2, so the recipe given before won't catch this case.

E3 intersects E2 though, so that's another thing to check:

  1. Lock all edges intersecting the incoming input line
  2. Lock all faces found on both sides of the above edges
  3. Lock all edges having any of the above faces on their side

We need manageable data to make this ticket workable so please if you have a way use scenarios like this one.

Note I filed https://github.com/larsop/resolve-overlap-and-gap/issues/16 to discuss the code you're working on to try these locking strategies.

by Lars Aksel Opsahl, 4 years ago

Attachment: db_topo_rowlevel_lock.zip added

A file to show how row level lock can be used to used avoid topo errors

comment:13 by Lars Aksel Opsahl, 4 years ago

A test case showing how to use row level lock to avoid Topology errors .

Info from the db_topo_rowlevel_lock_readme.txt in file db_topo_rowlevel_lock.zip.

# Create a empty database

psql postgres -c'create database t1'

# Add the code needed for testing

t1 -f db_topo_rowlevel_lock_init.sql 

# Test with no rowlevel locking, should return something like this

psql t1 -f db_topo_rowlevel_lock_test_fail.sql 
 ?column?  |      error       | id1 | id2 
------------+------------------+-----+-----
 validation | face within face |   1 |   2
 validation | face within face |   2 |   1
 validation | face within face |   1 |   3
 validation | face within face |   2 |   3
(4 rows)

# Test with rowlevel locking , no topop errors should be found

psql t1 -f db_topo_rowlevel_lock_test_ok.sql 

Last edited 4 years ago by Lars Aksel Opsahl (previous) (diff)

comment:14 by Lars Aksel Opsahl, 4 years ago

https://trac.osgeo.org/postgis/attachment/ticket/4684/db_topo_bug_4684.zip was not showing bug in Postgis Topology bug it was bug was related to missing primary key , I have now been running 60 interations with no errors. The new correct SQL is uploaded.

So the only problem we here are when multiple threads working with related spatial areas, which are expected.

Last edited 4 years ago by Lars Aksel Opsahl (previous) (diff)

by Lars Aksel Opsahl, 4 years ago

Attachment: db_topo_bug_4684.zip added

There will not generate any errors.

comment:15 by Algunenano, 4 years ago

Milestone: PostGIS 3.0.2

comment:16 by robe, 4 years ago

Milestone: PostGIS 3.0.2PostGIS 3.0.3

comment:17 by robe, 4 years ago

Milestone: PostGIS 3.0.3PostGIS 3.0.4

comment:18 by strk, 4 years ago

While thinking about how to include row-level locking in core I found another case of parallel insertion of objects creating invalid topology which cannot be resolved by row-level locking primitive tables. It is the case of parallel insertion of a face and an isolated node or line insode the face created by the other process. Simplest scenario:

  • Process1 inserts a closed line in the empty topology
  • Process2 inserts a point inside the closed line (non visible by its transaction)

Using SQL:

P0: select createtopology('isoinface'); P1: begin; select topogeo_addpoint('isoinface', 'POINT(5 5)'); P2: begin: select topogeo_addlinestring('isoinface', 'LINESTRING(0 0,0 10,10 10,10 0,0 0)'); P1: commit; P2: commit;

At the end of the above process, we'll have th isoinface topology containing 1 face, 1 edge and 2 nodes. The ISOLATED node (the one inserted by P1) will have a wrong containing_face value:

=# select n.node_id, st_astext(n.geom) g, n.containing_face, 
   f.face_id computed_containing_face 
   FROM isoinface.node n
   left outer join isoinface.face f 
   ON (f.face_id != 0 and ST_Contains(ST_GetFaceGeometry('isoinface', f.face_id), n.geom));
 node_id |     g      | containing_face | computed_containing_face 
---------+------------+-----------------+--------------------------
       1 | POINT(5 5) |               0 |                        1
       2 | POINT(0 0) |                 |                         
(2 rows)

This kind of invalidity is not currently detected by ValidateTopology function (see #3233)

Version 0, edited 4 years ago by strk (next)

comment:19 by strk, 4 years ago

Same kind of invalidity (wrong containment info) happens for isolated edges:

Using SQL:

  • P0: select createtopology('isoinface');
  • P1: begin; select topogeo_addpoint('isoinface', 'LINESTRING(4 5, 6 5)');
  • P2: begin: select topogeo_addlinestring('isoinface', 'LINESTRING(0 0,0 10,10 10,10 0,0 0)');
  • P1: commit;
  • P2: commit;

At the end of the above process, we'll have th isoinface topology containing 1 face, 2 edge and 3 nodes. The ISOLATED edge (the one inserted by P1) will have a wrong left_face and right_face values:

=# select e.edge_id, st_astext(e.geom) g, e.left_face, e.right_face, 
   f.face_id as containing_face
   from isoinface.edge e left outer join isoinface.face f
   on (f.face_id != 0 and ST_Contains(ST_GetFaceGeometry('isoinface', f.face_id), e.geom));
 edge_id |                  g                  | left_face | right_face | containing_face 
---------+-------------------------------------+-----------+------------+-----------------
       2 | LINESTRING(0 0,0 10,10 10,10 0,0 0) |         0 |          2 |                
       3 | LINESTRING(4 4,6 6)                 |         0 |          0 |               2
(2 rows)

This kind of invalidity is also not currently detected by ValidateTopology function (see #4830)

comment:20 by strk, 4 years ago

For the two cases above there's no row to be locked because the problem is with rows yet to be visible by the other process. The only way to ensure objects in an area would be visible by other transaction would be to acquire a lock on an *area*, reguardless of objects already existing in the area. We could think of a special "spatial_locks" table to be created in each topology where each process would insert locks for "areas", obtainable only for areas not intersecting any other already locked area.

comment:21 by strk, 4 years ago

If we go the "locking area" approach, the problem of figuring how the extent of the area to be locked remains, as such area would be affected by existing primitives in the topology (ie: adding a small edge closing the last gap in a large ring)

comment:22 by Lars Aksel Opsahl, 4 years ago

I had big problems to get spatial locks that worked with out errors and OK performance so that's way I ended up using grid on grid where I added lines crossing cell borders in a controlled way by merging grid cells and at end I merged the last cells in a single thread. But this solution where add cell crossing lines as final step only works when pumping data into a empty layer. When doing locks I tried use a meta table where each thread updated the spatial to be effected,

Yes so I agree with you that we need a kind of spatial lock for updating.

comment:23 by strk, 4 years ago

In order to have a clearer picture of what's going wrong, I'v efixed #3233 (ValidateTopology check on node's containing_face) - This only affects topologies with isolated nodes, so not necessarely something you can test, but you could maybe run some performance comparison as I bet ValidateTopology will now be much slower.

comment:24 by strk, 4 years ago

I've now also committed a fix for #4830 (ValidateTopology check on edge left/right faces)

comment:25 by strk, 3 years ago

I'm thinking for a start we could make it so that every routine possibly editing the topology would lock the topology.topology record for UPDATE. It would be the safest bet for a start.

comment:26 by strk, 3 years ago

Merge request https://gitlab.com/postgis/postgis/-/merge_requests/49 stubs a test for one case of concurrent editing topology corruption

comment:27 by strk, 3 years ago

One thing to consider would be to expose an explicit topology locking routine, so that users who know they are going to perform multiple edits in a single transaction could obtain the lock once, rather than relying on internal locking, which would need to be checked once for each and every single edit (consider adding a single line may trigger multiple inserts and updates)

comment:28 by strk, 3 years ago

I've pushed a script to see the effects of concurrent topology population. Interesting, in some cases you also get deadlocks, like in this output:

$ ./check_multiprocess_topology_population.sh 
ERROR:  SQL/MM Spatial exception - geometry crosses edge 1
ERROR:  Edge changed disposition around start node 1
ERROR:  Side-location conflict: new edge starts in face 1 and ends in face 2
ERROR:  SQL/MM Spatial exception - geometry crosses edge 5
ERROR:  deadlock detected
DETAIL:  Process 174600 waits for ShareLock on transaction 3178504; blocked by process 174602.
Process 174602 waits for ShareLock on transaction 3178502; blocked by process 174600.
HINT:  See server log for query details.
CONTEXT:  while updating tuple (1,3) in relation "edge_data"
SQL statement "UPDATE "mt_topo".edge_data SET next_right_edge= -16, abs_next_right_edge= 16 WHERE start_node = 5 AND next_right_edge= -5 AND  abs_next_right_edge= 5 AND edge_id != 16"
Topology mt_topo (id 697, SRID 0, precision 0)
11 nodes, 17 edges, 9 faces, 0 topogeoms in 0 layers

comment:29 by strk, 3 years ago

Merge request https://gitlab.com/postgis/postgis/-/merge_requests/51 takes the approach of advisory locks

comment:30 by strk, 3 years ago

It looks like another effective approach to prevent invalidities would be to ensure every transaction possibly modifying a topology runs with SERIALIZABLE TRANSACTION ISOLATION LEVEL. In that case it looks like all possible invalidities will be prevented by a serialization_failure exception.

comment:31 by strk, 3 years ago

I just learnt that SERIALIZABLE tracking is "pessimistic", that is it may give false conflicts (error out more than needed). This may result in actually slower performance than forced serialization via locks. A test running 16 concurrent transactions each adding a polygon to a topology completes in 1 second when serialized while can take 10 times as much when using SERIALIZABLE TRANSACTION ISOLATION LEVEL and re-trying after sleeping for a random amount of time up to 1 second. Some transactions are re-attempted up to 10 times before succeeding:

[strk@c19:/usr/local/src/postgis/postgis(topology-advisory-lock)] time topology/test/check_multiprocess_topology_population.sh 2>&1 | grep WARNING
WARNING:  15917 succeeded on attempt 1
WARNING:  15950 succeeded on attempt 1
WARNING:  16045 succeeded on attempt 1
WARNING:  16046 succeeded on attempt 2
WARNING:  16025 succeeded on attempt 3
WARNING:  16040 succeeded on attempt 3
WARNING:  15940 succeeded on attempt 4
WARNING:  15994 succeeded on attempt 4
WARNING:  16000 succeeded on attempt 4
WARNING:  16005 succeeded on attempt 6
WARNING:  16011 succeeded on attempt 5
WARNING:  16044 succeeded on attempt 6
WARNING:  15924 succeeded on attempt 6
WARNING:  15999 succeeded on attempt 8
WARNING:  15996 succeeded on attempt 9
WARNING:  16028 succeeded on attempt 9

real    0m5.392s
user    0m0.729s
sys     0m0.168s

comment:32 by robe, 3 years ago

strk you have plans to do anything with this in the next couple of days or should I push to next milestone?

comment:33 by strk, 3 years ago

Milestone: PostGIS 3.0.4PostGIS 3.3.0

Pushed forward. At this stage I think the best thing to do would be just documenting the risks of concurrently invoking topology editing functions.

comment:34 by strk, 3 years ago

Keywords: reliability added

comment:35 by robe, 2 years ago

Milestone: PostGIS 3.3.0PostGIS 3.4.0

comment:36 by strk, 16 months ago

Milestone: PostGIS 3.4.0PostGIS Fund Me
Note: See TracTickets for help on using tickets.