Opened 10 years ago

Closed 10 years ago

#851 closed defect (fixed)

[raster] Numerous ST_Value query cause the memory to grow indefinitely

Reported by: pracine Owned by: jorgearevalo
Priority: blocker Milestone: PostGIS 2.0.0
Component: raster Version: master
Keywords: Cc:

Description

A query like this:

SELECT ST_Value(rast, 1, x, y) FROM generate_series(1, 10) x, generate_series(1, 10) y, srtm_22_03_tiled_10x10

causes the memory to grow indefinitely finally returning an error. There seems to be a memory leek somewhere. I am wondering if we should not use PG_FREE_IF_COPY(pgraster,0) in rt_pg.c

Also: Smaller the tiles, faster the memory grows. But I think this is only related to the page size. Equivalent area splitted in small tiles just loads faster.

Change History (37)

comment:1 Changed 10 years ago by pracine

The problem seems to come from "get_rt_context" which switch the context of the most PostGIS raster functions preventing the PostgreSQL function manager to free the memory allocated to the context.

Shall we not delete this context using rt_context_destroy(ctx) in all the functions except those returning PG_RETURN_POINTER(pgraster)?

comment:2 Changed 10 years ago by strk

I've the impression that the ctx variable within get_rt_context was meant to be a static. Or the if ( ctx ) return ctx; would make no sense.

Haven't tried it myself, but would be worth a quick test.

comment:3 Changed 10 years ago by pracine

Making it static in the get_rt_context cause the server to crash. Should it be static in every function or rt_pg?

comment:4 Changed 10 years ago by strk

Uhm.. the TODO in the function might be the cause of the crash (destroyed too early by postgresql ?) Where does the crash happen ? Have a backtrace or even better a valgrind report?

comment:5 Changed 10 years ago by pracine

I know that if I don't make it static and I use rt_context_destroy(rt_context ctx) at the end of the proper rt_pg functions, everything goes fine and momory do not grow anymore. My guess is that this recipe should not be applied to function returning PG_RETURN_POINTER(pgraster)

But making it static seems to be nicer, so I will trace a little bit.

comment:6 Changed 10 years ago by strk

Fixing rt_context_new itself to use malloc rather than the provided allocator might be a quicker fix than playing with the memory context switch.

comment:7 Changed 10 years ago by pracine

Why?

Should ctx still be static anyway?

Should it be static in every rt_pg function?

comment:8 Changed 10 years ago by strk

Did you get any interesing traces about your crash yet ?

I was just guessing that the memory context in which ctx is allocated is being cleared while still used (in the static case). Using malloc for allocating it would confirm that's the case.

I don't have much time atm to take a closer look at the issue.

comment:9 Changed 10 years ago by jorgearevalo

I don't really find the meaning of the three fist lines of "get_rt_context"

rt_context ctx = 0;

MemoryContext? old_context;

if ( ctx ) return ctx;

As strk said, seems that ctx was a static variable sometime in the past, but now the third line has no sense.

Pierre, did you find any problem with get_rt_context? Do you have a log? I'm trying to reproduce it before applying the suggested changes (make ctx static, replace palloc by malloc).

comment:10 Changed 10 years ago by jorgearevalo

Tested with this file: http://gis-lab.info/data/srtm-tif/srtm_35_04.zip, block size 50x50px. No memory problems. Ubuntu 9.10, PostgreSQL 8.4.5, PostGIS 2.0.0SVN.

comment:11 Changed 10 years ago by pracine

No memory growing? Nothing? What if you test on a coverage composed of smaller tiles (10x10)? The memory should grow faster and higher.

This memory growth might be acceptable on one srtm raster but it overflows on a coverage composed of many of those.

This symptom is generalised to most functions. It is worst with ST_Value because when iterating over a raster you call it millions of time but the same problem would show up with most function if there are millions of tiles.

I am pretty sure this is the same problem Etienne encountered here:

http://postgis.refractions.net/pipermail/postgis-devel/2011-January/011505.html

The test I did was to simplify RASTER_getPixelValue to the point where it just returns a constant, no ctx, nada and in this case there is no general memory growth. Every function should release any memory allocated and the memory use should be almost constant without growing constantly.

comment:12 Changed 10 years ago by pracine

Again if I put rt_context_destroy(rt_context ctx) at the end of the function, it eliminate the memory growth and everything goes smooth. I'm just afraid this is not the best fix.

comment:13 Changed 10 years ago by strk

@pracine: you should get pretty much the same behavior with less effort by dropping all the MemoryContext? switch in rt_get_context, as in that case the context itself should be allocated in the function's memory context. You may try that to centralize control over that in a single place (the context getter function) and easily toggle between the static and non-static version.

comment:14 Changed 10 years ago by pracine

Priority: criticalblocker

comment:15 Changed 10 years ago by pracine

Owner: changed from pracine to jorgearevalo

comment:16 Changed 10 years ago by jorgearevalo

Status: newassigned

For the record: After 8h executing the query in a 64bits machine, with a 10x10px tiled raster, it took 240MB. 180MB from heap. And growing. As Pierre said, typical memory leak.

Same symptom in MapAlgebra?, like Pierre said too. And there're 2 common potential memory-leak points:

  • rt_context_new: context allocated is not freed. It should be always freed.
  • rt_raster_deserialize: allocated memory for raster object. It should be always freed. Even when the function returns that raster, because the raster is serialized again before returning it.

Freeing that memory should solve the memory leak. Testing it.

comment:17 Changed 10 years ago by strk

Note that the raster object memory should be already freed by postgresql itself, assuming the memory context is correct. Doesn't hurt to be explicit about that though.

comment:18 Changed 10 years ago by jorgearevalo

Yes, the pgraster object is freed by postgresql, but the raster object deserialized with a core function should be explicitly freed, like the context created. Correct me if I'm wrong.

Added some memory deallocations in r7008, but a small memory grow still exist. Debugging.

comment:19 Changed 10 years ago by strk

The core deserializer uses allocators set in the rt_context. In the rt_pg case, this is palloc, which is supposed to release the memory when the associated pool goes out of scope.

Now, the scope might be wider than you'd like, in which case it's fine to release manually, but it shouldn't live cross a single query.

comment:20 Changed 10 years ago by jorgearevalo

I've done these tests:

  • Execute the query without changes.
  • Simplify RASTER_getPixelValue to return a constant float value. It does nothing more.
  • Avoid the RASTER_getPixelValue call and execute this query:

SELECT x, y FROM generate_series(1, 10) x, generate_series(1, 10) y, srtm_35_04_10x10

And I'm getting the same memory growth in all cases. My table srtm_35_04_10x10 has 360000 entries. In my 64 bits machine, with 4GB RAM, all queries end, but take about 60-70% of available memory. In a 32 bits machine with 1GB RAM, the process is killed (even when I'm not calling RASTER_getPixelValue at all)

So, apparently the problem is not caused by RASTER_getPixelValue. It's related with the 36000000 rows postgres has to manage. Has sense?

comment:21 Changed 10 years ago by strk

Makes sense to me. If confirmed I'd recommend going back to avoid explicit destruction of the rt_context. That way it will still be possible to turn the context into a static by acting in a single place.

comment:22 Changed 10 years ago by jorgearevalo

Yep, you're right. The memory for rt_context, rt_raster and pgraster should be automatically released, because is allocated with palloc, at the end. There's no need of explicit destruction.

Anyway, shouldn't postgres start swapping memory instead of crashing? Maybe I'm missing something.

comment:23 Changed 10 years ago by pracine

Did you do the test of adding rt_context_destroy(rt_context ctx) at the end of the RASTER_getPixelValue function?

comment:24 Changed 10 years ago by pracine

This is strange as I did the test of simplifying RASTER_getPixelValue to return a constant float value and I there was no more memory growth... Did you remove everything related to context in this simplified version?

What if you do:

CREATE TABLE toto AS SELECT x, y FROM generate_series(1, 10) x, generate_series(1, 10) y, srtm_35_04_10x10;

instead?

comment:25 Changed 10 years ago by pracine

I just tested again:

SELECT x, y FROM generate_series(1, 10) x, generate_series(1, 10) y, srtm_35_04_10x10

and the memory doesn't grow...

SELECT version() gives:

"PostgreSQL 8.4.1 on i686-pc-mingw32, compiled by GCC gcc.exe (GCC) 3.4.5 (mingw-vista special r3), 32-bit"

SELECT postgis_version() gives:

"2.0 USE_GEOS=1 USE_PROJ=1 USE_STATS=1"

SELECT postgis_lib_version() gives:

"2.0.0SVN"

SELECT postgis_raster_lib_version() gives:

"2.0.0SVN"

It seems very strange to me that:

SELECT x, y FROM generate_series(1, 10) x, generate_series(1, 10) y, srtm_35_04_10x10;

makes the backend memory to grow. I can understand that the client grow because it is receiving many rows but the backend should behave smoothly (isn't that the strenght of PostgreSQL?).

comment:26 Changed 10 years ago by pracine

Apparently the memory do not grow anymore with the original query... So the bug seem fixed. I do more test...

comment:27 Changed 10 years ago by jorgearevalo

Ok, just reviewed the stupid script I did to test the memory growth. Yes, I was testing the psql process, not the postgres server process. So, with the current trunk version (with rt_context_destroy), there's no memory growth for me.

Anyway, as strk said, there should be no need of explicitly destroying context or raster. I'll do more tests without rt_context_destroy tomorrow. Results here.

comment:28 Changed 10 years ago by pracine

Here are my results:

CREATE OR REPLACE FUNCTION ST_Values(rast raster, band int, OUT val double precision)
RETURNS SETOF double precision
    AS $$
    SELECT ST_Value($1, $2, x, y) val FROM generate_series(1, ST_Width($1)) x , generate_series(1, ST_Height($1)) y;
    $$
    LANGUAGE SQL IMMUTABLE;

--Test 1
--With rt_context_destroy(): no memory growth
--Without rt_context_destroy(): very small memory growth
--2600 milliseconds for 10000
SELECT ST_Values(rast, 1) val, count(*) count 
FROM srtm_22_03_tiled_10x10
--WHERE rid <= 10000
GROUP BY val
ORDER BY count DESC

--Test 2
--With rt_context_destroy(): very small memory growth
--Without rt_context_destroy(): good memory growth. Crashes eventually.
--90000 milliseconds for 10000
SELECT val, count(*) count
FROM (SELECT ST_Value(rast, 1, x, y) val
      FROM generate_series(1, 10) x, generate_series(1, 10) y, srtm_22_03_tiled_10x10
      --WHERE rid <= 10000
     ) foo
GROUP BY val
ORDER BY count DESC;

--Test 3
--With rt_context_destroy(): no memory growth
--Without rt_context_destroy(): good memory growth. Crashes eventually.
--8500 milliseconds for 10000
SELECT ST_Value(rast, 1, x, y) 
FROM generate_series(1, 10) x, generate_series(1, 10) y, srtm_22_03_tiled_10x10 
--WHERE rid <= 10000

--Test 4
--With rt_context_destroy(): no memory growth
--Without rt_context_destroy(): no memory growth
--10000 milliseconds for 10000
SELECT x, y 
FROM generate_series(1, 10) x, generate_series(1, 10) y, srtm_22_03_tiled_10x10
--WHERE rid <= 10000

Conclusion: rt_context_destroy() is still very useful...

Remark: test 2 used to be twice faster than test 1. I will eventually search further in the code history when it became slower. I suspect a change in serialization.

The goal of all that is to make sure PL/pgSQL functions can iterate on every pixels of a coverage fast enough and without crashing.

comment:29 Changed 10 years ago by jorgearevalo

I solved this problem. I've reviewed the entire memory management system of PostGIS Raster, and changed several things. Now rt_core/rt_pg work in the same way liblwgeom/postgis do. Same philosopy. Same memory contexts. All core and regression tests pass but 2. I'm working to solve them. As soon as I fix them, I'll commit the code.

comment:30 Changed 10 years ago by pracine

This seems like is a huge change. Please do not limit yourself to regression tests. Test many things manually and make sure we do not lose on performance. We are near from feature/code freeze for 2.0.

comment:31 Changed 10 years ago by jorgearevalo

Of course. I'm copying the things that work on PostGIS. But the changes can be summarized in:

  • using global vars to store memory allocators, instead of a struct (rt_context), that must be allocated in every function call.
  • working on the SQL calling function memory context. "MemoryContextSwitchTo?" is used only when you need a persistent memory context to store info (SRF).

So, it's not so huge change, really.

comment:32 Changed 10 years ago by strk

Please keep the rt_context struct in place, it can still be like a global by returning it a static pointer from the getter.

comment:33 Changed 10 years ago by jorgearevalo

Ok, no problem with that. But what are the benefits of having a struct to keep pointers to functions against using variables, like done in PostGIS?

Now, all raster core functions receive a pointer to rt_context object (static or dinamically allocated in rt_pg functions, it doesn't matter). In liblwgeom all functions use lwalloc, that simply calls the function pointed by lwalloc_var, declared in liblwgeom.h and set in the caller functions (the functions in postgis directory). Do you think the raster approach is better?

comment:34 Changed 10 years ago by strk

Using a structure is more easily extensible and might allow specific functions to allocate their own. I didn't really look at the code recently but I'd advice to keep extensibility and ease of maintainance in mind.

comment:35 Changed 10 years ago by jorgearevalo

Understood. The main problem I saw is that every time a raster core function is called (for example, rt_band_get_pixel) a new chunk of memory for rt_context is allocated.

This memory is allocated in fcinfo->flinfo->fn_mcxt memory context, a postgres memory context that has query lifespan, and could be even longer lived than that. So, this memory allocated for rt_context must be always released, to avoid memory leaks in cases of losing control because any postgres function throws a elog(ERROR).

So, following Tom Lane's advices, I think most of the raster core functions should work in function calling memory context, not in fcinfo->flinfo->fn_mcxt. The only exception so far is rt_dump_as_wktpolygons function, that needs to store some data to live between function calls. For that reason, it uses multi_call_memory_ctx, the recommended context for this kind of functions (SRF).

Apart from this problem, using a struct to keep pointers to all memory functions together (like rt_context) instead of using single vars (like PostGIS does) is not a big deal. But I think a static rt_context var would be less harmful, in terms of memory consumption.

comment:36 Changed 10 years ago by strk

I agree a static context would be cheaper memory wise. Until there's a need to tweak it on a per-function basis it should be just fine to keep it static.

comment:37 Changed 10 years ago by jorgearevalo

Resolution: fixed
Status: assignedclosed

With the modifications made in r7106 there's no memory growth for me. I've passed all core and regress tests, and the queries shown in this ticket. I'm preparing a README on raster memory management, to commit it today. All comments and suggestions are welcome.

Anyway, I close this ticket.

Note: See TracTickets for help on using tickets.