Opened 11 months ago

Closed 10 months ago

Last modified 10 months ago

#5667 closed enhancement (fixed)

Performance Postgis Topology when adding million lines by using TopoGeo_AddLineString

Reported by: Lars Aksel Opsahl Owned by: strk
Priority: medium Milestone: PostGIS 3.5.0
Component: topology Version: 3.4.x
Keywords: performance Cc: Lars Aksel Opsahl

Description

Is is difficult to add new method for TopoGeo_AddLineString(

TopoGeo_AddLineString(varchar atopology, geometry aline, float8 tolerance)

with new method like this

TopoGeo_AddLineString(varchar atopology, geometry[] lines, float8 tolerance)

so we can add multiple lines in a single call ?

Can a method like this make it easier to increase performance and reduce the number off calls like the one below

total_min       |        avg_ms         |   calls   |                                                                                                                                                                                                                                                        query  
8.63888982881786 |  0.009448244050939514 |  54860288 | SELECT 1 FROM ONLY "gronn_test_03_test_topo"."face" x WHERE "face_id" OPERATOR(pg_catalog.=) $1 FOR KEY SHARE OF x
9.254862599314976 |  0.010973764762885281 |  50601755 | SELECT 1 FROM ONLY "gronn_test_03_test_topo"."node" x WHERE "node_id" OPERATOR(pg_catalog.=) $1 FOR KEY SHARE OF x
16.16802178720142 |   0.01995163239587184 |  48621651 | SELECT id,srid,precision,null::geometry FROM topology.topology WHERE name = $1::varchar

Change History (26)

comment:1 by strk, 11 months ago

How many calls to TopoGeo_AddLineString resulted in over 48 million selects to the topology.topology table ? That query comes from the C topology library that takes care of loading the topology. I guess some cache could be put in place to avoid those repeated calls.

comment:2 by strk, 11 months ago

Keywords: performance added

I agree about having more batch-loading functions, in general. Such function could be taking a generic geometry and was indeed forseen already: https://git.osgeo.org/gitea/postgis/postgis/src/branch/master/topology/sql/populate.sql.in#L725

It was planned at least since 2.0.0: https://git.osgeo.org/gitea/postgis/postgis/src/tag/2.0.0/topology/sql/populate.sql.in.c#L1057

In the planned form it had no return, while the implemented functions each return a set of identifiers of primitives corresponding to the incoming lines. Returning primitives helps in building TopoGeometry objects.

Your idea of taking an array of lines might help with matching back such a sets of primitives to the input geometry they would belong to, so maybe the return code of your suggested signature could be a TopoElementArray[]

comment:3 by Lars Aksel Opsahl, 11 months ago

Keywords: performance removed

That was not many I think we have a ratio off about 70 call for each call line add latest time i checked. I the C code it seems like that info could be added as parameter so it could be one to one .

in reply to:  2 comment:4 by Lars Aksel Opsahl, 11 months ago

Replying to strk:

Your idea of taking an array of lines might help with matching back such a sets of primitives to the input geometry they would belong to, so maybe the return code of your suggested signature could be a TopoElementArray[]

Yes that would be perfect.

in reply to:  3 comment:5 by Lars Aksel Opsahl, 11 months ago

Replying to Lars Aksel Opsahl:

That was not many I think we have a ratio off about 70 call for each call line add latest time i checked. I the C code it seems like that info could be added as parameter so it could be one to one .

I may have been wrong there with ratio, maybe more 1 to 20

total_min       |        avg_ms         |   calls   | query  
78.72177638407608 |   0.02124369923476904 | 222339176 | SELECT id,srid,precision,$2::geometry FROM topology.topology WHERE name = $1::varchar
39.22709295360309 |  0.012033813981130253 | 195584341 | SELECT $2 FROM ONLY "gronn_test_03_test_topo"."face" x WHERE "face_id" OPERATOR(pg_catalog.=) $1 FOR KEY SHARE OF x
41.32200506109425 |   0.01378028794816383 | 179917888 | SELECT $2 FROM ONLY "gronn_test_03_test_topo"."node" x WHERE "node_id" OPERATOR(pg_catalog.=) $1 FOR KEY SHARE OF x
4910.569903556195 |    26.160878027420043 |  11262397 | SELECT ARRAY(SELECT topology.TopoGeo_addLinestring(_topology_name, new_line, _snap_tolerance))


comment:6 by strk, 11 months ago

20 queries of topology.topology every single TopoGeo_addLineString is unexpected, I cannot reproduce it with POSTGIS="3.5.0dev 3.4.0rc1-926-g799a63a81"

Also the query in your first report started with:

SELECT id,srid,precision,null::geometry

while the second started with:

SELECT id,srid,precision,$2::geometry

The first one I recognize as coming from the cb_loadTopologyByName C function which is the one I see called once per TopoGeo_addLineString call. In the first report we had 1.12 calls of that one for every call of TopoGeo_addLineString.

The second one I cannot find in the source tree of current PostGIS master branch (git grep 'id,srid,precision') so chances are it is coming from client code and would still be there if PostGIS implemented an internal cache to bring down those PostGIS queries down to 1 per statement.

What is the full version of postgis you are using ?

comment:7 by Lars Aksel Opsahl, 11 months ago

 select version();
                                                               version                                                               
-------------------------------------------------------------------------------------------------------------------------------------
 PostgreSQL 12.6 (Ubuntu 12.6-0ubuntu0.20.04.1) on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0, 64-bit
(1 row)

vroom5.int.nibio.no postgres@rog_01=# SELECT PostGIS_Full_Version();
                                                                                                                                                                   postgis_full_version                                                                                                                                                                    
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 POSTGIS="3.4.0 0874ea3" [EXTENSION] PGSQL="120" GEOS="3.12.1-CAPI-1.18.1" (compiled against GEOS 3.10.2) SFCGAL="1.3.7" PROJ="8.2.0 NETWORK_ENABLED=OFF URL_ENDPOINT=https://cdn.proj.org USER_WRITABLE_DIRECTORY=/tmp/proj DATABASE_PATH=/usr/share/proj/proj.db" LIBXML="2.9.10" LIBJSON="0.13.1" LIBPROTOBUF="1.3.3" WAGYU="0.5.0 (Internal)" TOPOLOGY
(1 row)

comment:8 by strk, 11 months ago

I cannot find the second query in the source tree of PostGIS-3.4.0 0874ea3 either:

[strk@c23:/usr/local/src/postgis/postgis-3.4((HEAD detached at 3.4.0))] git rev-parse HEAD; git grep 'srid,precision'
0874ea342af5392e3cd9f4e6157ef08648c9d2d8
topology/postgis_topology.c:    "SELECT id,srid,precision,null::geometry "
topology/test/regress/copytopology.sql:SELECT srid,precision FROM topology.topology WHERE name = 'CITY_data_UP_down';
topology/test/regress/createtopology.sql:SELECT name,srid,precision,hasz from topology.topology
topology/test/regress/topologysummary.sql:INSERT INTO topology.topology (id,name,srid,precision,hasz)

Are you sure that query is not coming from code external to PostGIS ? I'm trying to reproduce the problem of multiple queries to topology.topology statements being triggered by a single call to TopoGeo_addLineString but I've not been able so far.

in reply to:  8 ; comment:9 by Lars Aksel Opsahl, 11 months ago

Replying to strk:

I cannot find the second query in the source tree of PostGIS-3.4.0 0874ea3

You mean this ?

SELECT $2 FROM ONLY "gronn_test_03_test_topo"."face" x WHERE "face_id" OPERATOR(pg_catalog.=) $1 FOR KEY SHARE OF x

Yes this is a call triggered by the system before doing a update and not directly issued by any user call if I understand correctly.

Last edited 11 months ago by Lars Aksel Opsahl (previous) (diff)

in reply to:  9 ; comment:10 by strk, 11 months ago

I mean this:

SELECT id,srid,precision,$2::geometry FROM topology.topology WHERE name = $1::varchar

in reply to:  10 comment:11 by Lars Aksel Opsahl, 11 months ago

Replying to strk:

I mean this:

SELECT id,srid,precision,$2::geometry FROM topology.topology WHERE name = $1::varchar

I thought this come from the below code but I see that's is different

"SELECT id,srid,precision,null::geometry "
    "FROM topology.topology WHERE name = $1::varchar";

And I can not find any code like this i may own code.

comment:12 by strk, 11 months ago

For the record: I've filed #5668 to implement a version of TopoGeo_addLineString taking a topology.topology record.

I'd still look for the query having a parameter for the geometry field so a downstream ticket would be useful for me to look at, or at least reproduce.

comment:13 by strk, 11 months ago

Still for the record: I've filed #5669 to implement a generic batch-loading function

comment:14 by strk, 11 months ago

I believe there is a 1:1 ratio between calls to TopoGeo_addLinestring and queries to topology.topology table. That same query also serves the purpose of fetching the OID of the geometry type, which is why the null::geometry part is in there.

Implementation of a cache would result in a single query per session, or per transaction, while taking a topology.topology record would still miss the geometry oid extraction that would then need to be still performed on a 1:1 ratio.

For this reason I'm thinking the best performance improvement here would be to implement a cache or the batch-loading function which would have similar effects. The cache would not require any change in the client code, which makes the performance improvement transparent to the users.

comment:15 by strk, 11 months ago

I've configured my PostgreSQL-14 cluster to preload the pg_stat_statements module and created the pg_stat_statements extension in a test database. Then I've created a topology with a single segment edge, reset the stats with SELECT pg_stat_statements_reset() and then used TopoGeo_addLineString to add a single-segment line crossing the existing line.

These are the statements (including nested) resulting form that simple operation:

 n  |                                                                                                              query                                                                                                               
----+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 11 | SELECT $2 FROM ONLY "t"."node" x WHERE "node_id" OPERATOR(pg_catalog.=) $1 FOR KEY SHARE OF x
  2 | SELECT edge_id,start_node,end_node,left_face,right_face,geom FROM "t".edge_data ORDER BY geom <-> $1 ASC LIMIT $2
 12 | SELECT $2 FROM ONLY "t"."face" x WHERE "face_id" OPERATOR(pg_catalog.=) $1 FOR KEY SHARE OF x
  1 | SELECT edge_id,start_node,end_node,left_face,right_face,next_left_edge,next_right_edge,geom FROM "t".edge_data WHERE edge_id IN ($1)
  3 | INSERT INTO "t".node (node_id,containing_face,geom) VALUES (DEFAULT,$1,$2::geometry) RETURNING node_id
  1 | UPDATE "t".edge_data SET end_node= $1,next_left_edge= $2, abs_next_left_edge= $3,geom=$4::geometry WHERE edge_id = $5
  3 | SELECT node_id,containing_face,geom FROM "t".node WHERE geom && $1::geometry
  3 | SELECT edge_id,geom FROM "t".edge_data WHERE ST_DWithin($1::geometry, geom, $2)
  3 | SELECT EXISTS ( SELECT $1 FROM "t".node WHERE ST_Equals(geom, $2::geometry))
  3 | SELECT nextval($1)
  3 | SELECT edge_id,start_node,end_node,left_face,right_face,next_left_edge,next_right_edge,geom FROM "t".edge WHERE geom && $1::geometry
  1 | SELECT r.element_id, r.topogeo_id, r.layer_id, r.element_type FROM "t".relation r , topology.layer l WHERE l.topology_id = $1 AND l.level = $2 AND l.layer_id = r.layer_id AND r.element_id IN ( $3, $4) AND r.element_type = $5
  3 | INSERT INTO "t".edge_data (edge_id,start_node,end_node,left_face,right_face,next_left_edge, abs_next_left_edge,next_right_edge, abs_next_right_edge,geom) VALUES ($1,$2,$3,$4,$5,$6,$7,$8,$9,$10::geometry)
  2 | SELECT EXISTS ( SELECT $1 FROM "t".edge_data WHERE ST_Within($2::geometry, geom))
  5 | SELECT edge_id,start_node,end_node,left_face,right_face,next_left_edge,next_right_edge,geom FROM "t".edge_data WHERE start_node IN ($1) OR end_node IN ($2)
  4 | SELECT node_id,geom FROM "t".node WHERE ST_DWithin(geom, $1::geometry, $2)
  1 | UPDATE "t".edge_data SET next_right_edge= $1, abs_next_right_edge= $2 WHERE start_node = $3 AND next_right_edge= $4 AND  abs_next_right_edge= $5 AND edge_id != $6
  4 | SELECT edge_id,geom FROM "t".edge WHERE geom && $1::geometry
  1 | select topogeo_addlinestring($1, $2)
  8 | SELECT $2 FROM ONLY "t"."edge_data" x WHERE "edge_id" OPERATOR(pg_catalog.=) $1 FOR KEY SHARE OF x
  2 | UPDATE "t".node SET containing_face =$1::int WHERE node_id = $2
  1 | UPDATE "t".edge_data SET next_left_edge= $1, abs_next_left_edge= $2 WHERE end_node= $3 AND next_left_edge= $4 AND  abs_next_left_edge= $5 AND edge_id != $6
  2 | SELECT node_id,containing_face,geom FROM "t".node WHERE node_id IN ($1,$2)
  1 | SELECT id,srid,precision,null::geometry FROM topology.topology WHERE name = $1::varchar
  1 | UPDATE "t".edge_data SET next_right_edge= $1, abs_next_right_edge= $2 WHERE edge_id = $3
  1 | UPDATE "t".edge_data SET next_left_edge= $1, abs_next_left_edge= $2 WHERE edge_id = $3
(26 rows)

The query I've used to extract the information:

select calls n, query from pg_stat_statements where dbid = ( select oid from pg_database where datname = current_database() ) and query not like '%pg_stat_statements%';

comment:16 by strk, 11 months ago

There is a topology/test/perf/TopoGeo_addLinestring.sql script that was written specifically to analyze performance of that function.

That script calls TopoGeo_addLinestring 30752 times and there are exactly 30752 queries to the topology.topology table (so ratio of 1/1).

I should also mention that the time cost of the topology.topology queries sums up to 142ms over the 44212ms of the statement resulting in those 30752 linestring additions, so we are talking about the 0.3% of the total operation. There are bigger fishes to tackle.

Biggest fish confirms being the query that looks for newly formed ringsas shown by a query to pg_stat_statements ordered by total execution time:

   n    | tot_ms | avg_ms |                                                            q
--------+--------+--------+-------------------------------------------------------------------------------------------------------------------------
      1 |  44212 |  44212 | SELECT count(*) FROM ( SELECT topology.TopoGeo_addLinestring($1, g) FROM topoperf.case_full_coverage_no_holes ) foo
   2883 |   3480 |      1 | WITH RECURSIVE edgering AS ( SELECT $1 as signed_edge_id, edge_id, next_left_edge, next_right_edge FROM "topoperf".edge
 141824 |   1818 |      0 | SELECT $2 FROM ONLY "topoperf"."face" x WHERE "face_id" OPERATOR(pg_catalog.=) $1 FOR KEY SHARE OF x
  46624 |   1783 |      0 | SELECT edge_id,start_node,end_node,left_face,right_face,next_left_edge,next_right_edge,geom FROM "topoperf".edge WHERE
    900 |   1688 |      2 | WITH newedges(edge_id,left_face) AS ( VALUES ($1,$2),($3,$4),($5,$6),($7,$8),($9,$10),($11,$12),($13,$14),($15,$16),($1
  46624 |   1557 |      0 | SELECT edge_id,geom FROM "topoperf".edge WHERE geom && $1::geometry
  46624 |   1427 |      0 | SELECT node_id,containing_face,geom FROM "topoperf".node WHERE geom && $1::geometry
    900 |   1147 |      1 | WITH newedges(edge_id,right_face) AS ( VALUES ($1,$2),($3,$4),($5,$6),($7,$8),($9,$10),($11,$12),($13,$14),($15,$16),($
  15870 |   1144 |      0 | UPDATE "topoperf".edge_data SET next_left_edge= $1, abs_next_left_edge= $2 WHERE edge_id = $3
  61504 |   1081 |      0 | SELECT node_id,geom FROM "topoperf".node WHERE ST_DWithin(geom, $1::geometry, $2)
  15872 |   1036 |      0 | INSERT INTO "topoperf".edge_data (edge_id,start_node,end_node,left_face,right_face,next_left_edge, abs_next_left_edge,n
    961 |    585 |      1 | SELECT edge_id,left_face,right_face,geom FROM "topoperf".edge_data WHERE ( left_face = ANY($1) OR right_face = ANY ($1)
  14912 |    542 |      0 | SELECT edge_id,start_node,end_node,left_face,right_face,geom FROM "topoperf".edge_data ORDER BY geom <-> $1 ASC LIMIT $
  14912 |    475 |      0 | INSERT INTO "topoperf".node (node_id,containing_face,geom) VALUES (DEFAULT,$1,$2::geometry) RETURNING node_id
  14912 |    386 |      0 | SELECT EXISTS ( SELECT $1 FROM "topoperf".edge_data WHERE ST_Within($2::geometry, geom))
 126912 |    338 |      0 | SELECT $2 FROM ONLY "topoperf"."node" x WHERE "node_id" OPERATOR(pg_catalog.=) $1 FOR KEY SHARE OF x
  14912 |    322 |      0 | UPDATE "topoperf".node SET containing_face =$1::int WHERE node_id = $2
  14912 |    311 |      0 | SELECT EXISTS ( SELECT $1 FROM "topoperf".node WHERE ST_Equals(geom, $2::geometry))
  46654 |    248 |      0 | SELECT edge_id,start_node,end_node,left_face,right_face,next_left_edge,next_right_edge,geom FROM "topoperf".edge_data W
  14912 |    220 |      0 | SELECT edge_id,geom FROM "topoperf".edge_data WHERE ST_DWithin($1::geometry, geom, $2)
      1 |    170 |    170 | CREATE TABLE topoperf.case_full_coverage_no_holes AS SELECT ST_Subdivide(ST_Segmentize(ST_Boundary(geom), 0.5), 5) g FR
  30752 |    142 |      0 | SELECT id,srid,precision,$2::geometry FROM topology.topology WHERE name = $1::varchar

comment:17 by strk, 11 months ago

Speaking of the big fish (finding newly create edge rings) I would mention that some internal function to exist to allow postponing this step to after all linework is noded and is found to be an effective way to speed-up batch topology creation. See https://git.osgeo.org/gitea/postgis/postgis/pulls/28#issuecomment-12359

comment:18 by strk, 11 months ago

Keywords: performance added
Summary: Performance Postgis Topolgy when adding millions by using TopoGeo_AddLineStringPerformance Postgis Topology when adding million lines by using TopoGeo_AddLineString

comment:19 by strk, 11 months ago

The most recent work on postponing face detection to after all lines are loaded is in ST_CreateTopoGeo, via #5670 - that work requires computing *all* faces at the end, while for the case at hand (adding a bunch of lines into an existing topology) we would probably want to only compute new faces within a given bounding box, which is ticketed as #917

comment:20 by strk, 11 months ago

The for key share of x command is related to foreign key constraints, see https://dba.stackexchange.com/questions/271167/what-does-key-share-of-in-a-postgresql-statement-mean

comment:21 by strk, 11 months ago

Initial draft of a TopoGeo_AddGeometry(topo, geom) function returning void is here: https://git.osgeo.org/gitea/postgis/postgis/pulls/178

comment:22 by strk, 11 months ago

For the sake of this ticket a version of the function taking ONLY lines (multilinestring, for example) would be preferred, to further get performance improvement opportunities.

What I'm not fully happy with is the API as the TopoGeo_addXXX family of functions all return the identifiers of the primitives added by the function while we mentioned we might save on that cost when adding million of lines…

comment:23 by strk, 10 months ago

I've settled for void TopoGeo_LoadGeometry for now so we have a naming convention to distinguish between functions that return primitive identifiers and functions that do not. See https://git.osgeo.org/gitea/postgis/postgis/pulls/190

comment:24 by Lars Aksel Opsahl, 10 months ago

One issue with void , is that we in some cases use the edge id list to compute area created and then remove edges that form face below given limit.

This may be done other ways also since we know the spatial lines that are added.

Last edited 10 months ago by Lars Aksel Opsahl (previous) (diff)

comment:25 by Sandro Santilli <strk@…>, 10 months ago

Resolution: fixed
Status: newclosed

In 1284f16/git:

Add TopoGeo_LoadGeometry function

Includes regress test and documentation

Closes #5667

comment:26 by strk, 10 months ago

Milestone: PostGIS 3.5.0
Note: See TracTickets for help on using tickets.