Opened 4 years ago

Closed 2 years ago

#4720 closed defect (fixed)

Memory leak when using Geography

Reported by: kthujvu Owned by: Algunenano
Priority: medium Milestone: PostGIS PostgreSQL
Component: postgis Version: 3.0.x
Keywords: geography memory Cc:

Description

A query with a single-line subquery involving Geography for each row ends up OOMing.

I loaded https://www.naturalearthdata.com/http//www.naturalearthdata.com/download/110m/cultural/ne_110m_admin_0_countries.zip using Geography using shp2pgsql -c -D -i -I -G ne_110m_admin_0_countries.shp > ne110.sql, then psql ... -f ne110.sql. There is an spatial index ne_110m_admin_0_countries_geog_idx.

In the queries below I create a grid of points using WITH, the problem also occurs if I have that data in a static table.

This query is eating all memory on my system:

WITH latitudes AS (
	SELECT generate_series AS latitude
	FROM generate_series(-90, 90, 0.1)
), longitudes AS (
	SELECT generate_series AS longitude
	FROM generate_series(-180, 180, 0.1)
), points AS (
	SELECT ST_SetSRID(ST_Point(longitude, latitude), 4326)::geography AS geog
	FROM latitudes
	CROSS JOIN longitudes
)
SELECT
	geog,
	(
		SELECT name
		FROM ne_110m_admin_0_countries AS ne
		ORDER BY p.geog <-> ne.geog
		LIMIT 1
	)
FROM points AS p
;

After 10 minutes: >2G and growing about 8M per second

This does not happen if I use Geometry for the points and cast the Geography data in the NE table to Geometry:

WITH latitudes AS (
	SELECT generate_series AS latitude
	FROM generate_series(-90, 90, 0.1)
), longitudes AS (
	SELECT generate_series AS longitude
	FROM generate_series(-180, 180, 0.1)
), points AS (
	SELECT ST_SetSRID(ST_Point(longitude, latitude), 4326) AS geom
	FROM latitudes
	CROSS JOIN longitudes
)
SELECT
	geom,
	(
		SELECT name
		FROM ne_110m_admin_0_countries AS ne
		ORDER BY p.geom <-> ne.geog::geometry
		LIMIT 1
	)
FROM points AS p
;

After 10 minutes: stable 107M

This does happen when I use the existing Geography data and Geometry for the point grid:

WITH latitudes AS (
	SELECT generate_series AS latitude
	FROM generate_series(-90, 90, 0.1)
), longitudes AS (
	SELECT generate_series AS longitude
	FROM generate_series(-180, 180, 0.1)
), points AS (
	SELECT ST_SetSRID(ST_Point(longitude, latitude), 4326) AS geom
	FROM latitudes
	CROSS JOIN longitudes
)
SELECT
	geom,
	(
		SELECT name
		FROM ne_110m_admin_0_countries AS ne
		ORDER BY p.geom <-> ne.geog
		LIMIT 1
	)
FROM points AS p
;

After ~5 minutes: >500MB and growing

It does not happen when I cast the existing Geography to Geometry and use Geography for the point data:

WITH latitudes AS (
	SELECT generate_series AS latitude
	FROM generate_series(-90, 90, 0.1)
), longitudes AS (
	SELECT generate_series AS longitude
	FROM generate_series(-180, 180, 0.1)
), points AS (
	SELECT ST_SetSRID(ST_Point(longitude, latitude), 4326)::geography AS geog
	FROM latitudes
	CROSS JOIN longitudes
)
SELECT
	geog,
	(
		SELECT name
		FROM ne_110m_admin_0_countries AS ne
		ORDER BY p.geog <-> ne.geog::geometry
		LIMIT 1
	)
FROM points AS p
;

After 10 minutes: stable 103M

It does not happen if I use the volatile data for both sides of the spatial query:

WITH latitudes AS (
	SELECT generate_series AS latitude
	FROM generate_series(-90, 90, 0.1)
), longitudes AS (
	SELECT generate_series AS longitude
	FROM generate_series(-180, 180, 0.1)
), points AS (
	SELECT ST_SetSRID(ST_Point(longitude, latitude), 4326)::geography AS geog
	FROM latitudes
	CROSS JOIN longitudes
)
SELECT
	geog,
	(
		SELECT geog
		FROM points AS p2
		ORDER BY p.geog <-> p2.geog
		LIMIT 1
	)
FROM points AS p
;

After 10 minutes: stable 110M


PostgreSQL 12.2 (Debian 12.2-2.pgdg100+1) on x86_64-pc-linux-gnu, compiled by gc c (Debian 8.3.0-6) 8.3.0, 64-bit

POSTGIS="3.0.1 ec2a9aa" [EXTENSION] PGSQL="120" GEOS="3.7.1-CAPI-1.11.1 27a5e771 " PROJ="Rel. 5.2.0, September 15th, 2018" GDAL="GDAL 2.4.0, released 2018/12/14" LIBXML="2.9.4" LIBJSON="0.12.1" LIBPROTOBUF="1.3.1" WAGYU="0.4.3 (Internal)" TOPO LOGY RASTER

Change History (20)

comment:1 by pramsey, 4 years ago

I'm afraid I'm not seeing this effect… at least not at anything remotely the level you describe. My memory use under both PostGIS 3.0 and 3.1 (master) is climbing only veeeery slowly (14M usage after 4 mintes). It *might* be a leak, but it's not the gusher you're seeing. Only difference in my test is using the 50m dataset instead of 110m. I will attempt with your exact data later.

comment:2 by kthujvu, 4 years ago

I also cannot reproduce it with the 50m dataset.

Here is the full "test suite" I used, uses docker (sorry):

docker run --name "postgis" -p 25432:5432 -d -t kartoza/postgis:12.1
docker exec -it postgis bash
# then in the container itself:
apt update
apt install postgis  # for shp2pgsql, update is needed first for the package to be discoverable
apt install unzip
wget https://www.naturalearthdata.com/http//www.naturalearthdata.com/download/110m/cultural/ne_110m_admin_0_countries.zip
wget https://www.naturalearthdata.com/http//www.naturalearthdata.com/download/50m/cultural/ne_50m_admin_0_countries.zip
unzip ne_50m_admin_0_countries.zip
shp2pgsql -c -D -i -I -G ne_50m_admin_0_countries.shp | PGPASSWORD=docker psql -h localhost -U docker gis
unzip ne_110m_admin_0_countries.zip
shp2pgsql -c -D -i -I -G ne_110m_admin_0_countries.shp | PGPASSWORD=docker psql -h localhost -U docker gis

Then in

PGPASSWORD=docker psql -h localhost -U docker gis

I tested the following query:

WITH latitudes AS (
	SELECT generate_series AS latitude
	FROM generate_series(-90, 90, 0.1)
), longitudes AS (
	SELECT generate_series AS longitude
	FROM generate_series(-180, 180, 0.1)
), points AS (
	SELECT ST_SetSRID(ST_Point(longitude, latitude), 4326)::geography AS geog
	FROM latitudes
	CROSS JOIN longitudes
)
SELECT
	geog,
	(
		SELECT name
		FROM ne_50m_admin_0_countries AS ne
		ORDER BY p.geog <-> ne.geog
		LIMIT 1
	)
FROM points AS p
;

That query on the ne_50m_admin_0_countries table: 250MB RAM (RES) usage after 30 minutes

Same query with ne_110m_admin_0_countries: > 2G RAM (RES) and growing ~8MB/s after 15 minutes

comment:3 by Algunenano, 4 years ago

Milestone: PostGIS 3.0.3

comment:4 by Algunenano, 4 years ago

I can reproduce it with the 110m query when loading the data from the shp as a geography.

The setup operation (to create the points relation) takes ~5 MB and mantains the memory stable, so the issue comes from the last JOIN.

Since the only way I know to leak memory from Postgis is in indexes (since those don't work with the same Memory context limits) I tried the same query disabling indexscan (set enable_indexscan = false;) and with that the issue does not appear.

I've had a look at the offending function (gserialized_gist_geog_distance) but I haven't seen anywhere were a leak might happen, so I don't really know if it's the index leaking memory or the function finishing faster and caching some results. It needs a more in depth investigation.

comment:5 by Algunenano, 4 years ago

I also see this behaviour if I load the column as a geometry and have an index over that geometry, even If I remove geography out of the query completely:

WITH latitudes AS (
SELECT generate_series AS latitude
FROM generate_series(-90, 90, 0.1)
), longitudes AS (
SELECT generate_series AS longitude
FROM generate_series(-180, 180, 0.1)
), points AS (
SELECT ST_SetSRID(ST_Point(longitude, latitude), 4326)::geometry AS geog
FROM latitudes
CROSS JOIN longitudes
)
SELECT
geog,
(
SELECT name
FROM ne_110m_admin_0_countries AS ne
ORDER BY p.geog <-> ne.the_geom
LIMIT 1
)
FROM points AS p
;

Plan:

 Nested Loop  (cost=0.01..3203541.08 rows=1000000 width=64)
   ->  Function Scan on generate_series  (cost=0.00..10.00 rows=1000 width=32)
   ->  Function Scan on generate_series generate_series_1  (cost=0.00..10.00 rows=1000 width=32)
   SubPlan 1
     ->  Limit  (cost=0.27..3.05 rows=1 width=17)
           ->  Index Scan using ne_110m_admin_0_countries_the_geom_idx on ne_110m_admin_0_countries ne  (cost=0.27..492.07 ro
ws=177 width=17)
                 Order By: (the_geom <-> st_setsrid(st_point((generate_series_1.generate_series)::double precision, (generate
_series.generate_series)::double precision), 4326))
(7 rows)

OTOH, if I use ST_Distance instead of <->, which doesn't use indexes, then the memory is really stable. Same if I use <-> but remove indexscan as a possible subplan.

So it looks like an issue with the index subplans, where either Postgis or Postgres, or both :D, leaking memory or caching way more things.

comment:6 by Algunenano, 4 years ago

As an extra test, I've tested several other index operators (replacing ←> in the query) and only <-> and <#> seem to exhibit this behaviour, but I don't know if it might have something to do with the ordering or maybe the recheck operations which I don't know how it works exactly.

comment:7 by pramsey, 4 years ago

Just reading through I'm not seeing any places where the ←> operator or the gserialized_gist_distance_2d() function fails to free allocated memory. It seems quite careful about that, in fact. Index support functions are one of the few places you can leak memory easily, since they don't clean up their memory contexts as often as function calls.

comment:8 by Algunenano, 4 years ago

One thing I've been thinking about and want to review is about the recheck flag. I think that when recheck is marked we end up calling the non-gist function (ST_Distance in this case), so it might be that ST_Distance needs to be "leak-free".

I'm going to have a look at how this recheck thingy works to see if that offers any explanations.

comment:9 by Algunenano, 4 years ago

Ok, so gist calls ←> (geometry, geometry) on recheck, and that is defined as:

CREATE OPERATOR <-> (
	LEFTARG = geometry, RIGHTARG = geometry, PROCEDURE = geometry_distance_centroid,
	COMMUTATOR = '<->'
);

Following the thread:

CREATE OR REPLACE FUNCTION geometry_distance_centroid(geom1 geometry, geom2 geometry)
	RETURNS float8
	AS 'MODULE_PATHNAME', 'ST_Distance'
	LANGUAGE 'c' IMMUTABLE STRICT PARALLEL SAFE
	_COST_MEDIUM;

So ST_Distance is being called in the index.

To simplify things and make sure only index related functions are being used from Postgis, I decided to create a new table to use that instead of the CTE:

CREATE TABLE example_points AS WITH latitudes AS (                                           
SELECT generate_series AS latitude
FROM generate_series(-90, 90, 0.1)
), longitudes AS (
SELECT generate_series AS longitude
FROM generate_series(-180, 180, 0.1)
), points AS (
SELECT ST_SetSRID(ST_Point(longitude, latitude), 4326)::geometry AS geog
FROM latitudes
CROSS JOIN longitudes
)
Select * from points;

Then the query is simplified to:

SELECT                                                        
geog,                            
(                                
SELECT name      
FROM ne_110m_admin_0_countries AS ne
ORDER BY p.geog <-> ne.the_geom
LIMIT 1                        
)                                                                 
FROM example_points AS p       
;

With the following plan:

                                                                  QUERY PLAN                                                                  
----------------------------------------------------------------------------------------------------------------------------------------------
 Seq Scan on example_points p  (cost=0.00..18181019.39 rows=6485383 width=64)
   SubPlan 1
     ->  Limit  (cost=0.14..2.79 rows=1 width=17)
           ->  Index Scan using ne_110m_admin_0_countries_the_geom_idx on ne_110m_admin_0_countries ne  (cost=0.14..468.49 rows=177 width=17)
                 Order By: (the_geom <-> p.geog)
(5 rows)

If I run this on current REL_12_STABLE postgres and master postgis the 2 inputs into Postgis that I see are:

               - 97.94% ExecSubPlan                                                                                                                                                                                                                         ▒
                  - 97.22% ExecLimit                                                                                                                                                                                                                        ▒
                     - 96.22% ExecScan                                                                                                                                                                                                                      ▒
                        - 76.51% IndexNextWithReorder                                                                                                                                                                                                       ▒
                           - 50.83% index_getnext_slot                                                                                                                                                                                                      ▒
                              - 49.76% gistgettuple                                                                                                                                                                                                         ▒
                                 - 46.69% gistScanPage                                                                                                                                                                                                      ▒
                                    - 26.41% FunctionCall5Coll                                                                                                                                                                                              ▒
                                       - 24.93% gserialized_gist_distance_2d 

And:

         - 98.37% ExecScan                                                                                                                                                                                                                                  ▒
            - 98.13% ExecInterpExpr                                                                                                                                                                                                                         ▒
               - 97.94% ExecSubPlan                                                                                                                                                                                                                         ▒
                  - 97.22% ExecLimit                                                                                                                                                                                                                        ▒
                     - 96.22% ExecScan                                                                                                                                                                                                                      ▒
                        - 76.51% IndexNextWithReorder                                                                                                                                                                                                       ▒
                           - 22.71% ExecInterpExpr                                                                                                                                                                                                          ▒
                              - 22.51% ST_Distance 

So if anything is leaking on our side it has to be related to the implementation of either gserialized_gist_distance_2d or ST_Distance.

After around 60 seconds running the query, the memory used by the backend in charge of the query was 3.7 GB and growing.

To compare with something, although I don't know how useful / comparable that is, I changed the implementation of gserialized_gist_distance_2d to return 0 and recheck = true without doing any actual work, and ST_Distance to return 0 without any work either. After 60 seconds running the query, the memory reserved by the process was growing but much smaller: 120 MB.

The next thing I've tested was to never recheck (always return false) and rerun the query. When I do this, after 60 seconds the memory is on 50 MB, also growing. But for some reason ST_Distance is still being called (way less times as it only represents 6.84% of the time.

So I'm 100% sure this has something to do with recheck now, but I don't know if this is on our side (ST_Distance) or in Postgres side.

comment:10 by Algunenano, 4 years ago

I've modified ST_Distance to create its own memory context (per call) and that removes the memory leak, so the problem is definitely there.

Since the function itself has 2 clear parts (geometry deserialization and lwgeom_mindistance2d call) I decided to narrow down the context change to the function call (after generating the geometries from the serialized buffers), and with that the leak was also gone, so lwgeom_mindistance2d might be allocating memory and not freeing it, which is odd since it's tested in liblwgeom under valgrind.

comment:11 by pramsey, 4 years ago

Owner: changed from pramsey to Algunenano

comment:12 by robe, 3 years ago

Milestone: PostGIS 3.0.3PostGIS 3.0.4

comment:13 by Paul Ramsey <pramsey@…>, 3 years ago

In c2193ed/git:

Add in test that exercises lw_dist2d_distribute_fast, references #4720

comment:14 by pramsey, 3 years ago

Hm, I don't see any valgrind tests of liblwgeom in the ci scripts anymore. Not that I'm seeing any problems running manually.

==98887== LEAK SUMMARY:
==98887==    definitely lost: 0 bytes in 0 blocks
==98887==    indirectly lost: 0 bytes in 0 blocks
==98887==      possibly lost: 0 bytes in 0 blocks

comment:15 by robe, 3 years ago

Any hopes of having this dealt with in next couple of days or should we push to 3.0.5?

comment:16 by Algunenano, 3 years ago

Milestone: PostGIS 3.0.4PostGIS Fund Me

It needs further investigation at both PG and Postgis level

comment:17 by kalenik, 2 years ago

I did investigation of this problem and discovered that memory leaks on PG side. Thread: https://www.postgresql.org/message-id/flat/CAHqSB9gECMENBQmpbv5rvmT3HTaORmMK3Ukg73DsX5H7EJV7jw%40mail.gmail.com

comment:18 by kalenik, 2 years ago

Fix is pushed in Postgres and backported to all supported versions https://github.com/postgres/postgres/commit/3f74daa8dfc634132ff7270212c57da5e316b001

comment:19 by kthujvu, 2 years ago

A big thanks to all of you for this thorough investigation and fix. You are awesome!

comment:20 by robe, 2 years ago

Milestone: PostGIS Fund MePostGIS PostgreSQL
Resolution: fixed
Status: newclosed

@kalenik thanks for fixing.

Note: See TracTickets for help on using tickets.