Opened 15 years ago

Closed 14 years ago

#3187 closed defect (fixed)

Oracle FastCGI memory leak with 5.6.0-betas

Reported by: jmckenna Owned by: aboudreault
Priority: normal Milestone: 5.6 release
Component: Input - Native Oracle Spatial Support Version: unspecified
Severity: normal Keywords:
Cc: jimk, msmitherdc, aaime, dmorissette, pramsey, aboudreault, sdlime

Description

Initially discovered in the benchmarking exercise:

From Frank:

Mike and I spent some time looking into the leak tonight to no real compelling effect I'm afraid. Mike confirmed the problem is just the oracle case, and only occurs in 5.6, not 5.4.

I ran mapserv for one oracle map request under valgrind and there was no obvious sign of big leakage beyond the one time connection setup costs.

I guessed there might be problems with the connection pooling, and did some fastcgi debug logging, but after Mike pointed out an obvious point I concluded that the connection pooling did seem to be working ok and there is no particular reason to believe there is a connection level leak.

The changes between 5.4 and 5.6 are huge.

From Steve Lime:

Jim found potential leaks in the template processing but that shouldn't come into play in this case and if a map is "free'd" then I believe those all go away. If not then you'd see problems with the other drivers as well. He's still looking...

Attachments (5)

oracle-fcgi.log (5.6 KB ) - added by aboudreault 14 years ago.
maporaclespatial.c.patch (765 bytes ) - added by jimk 14 years ago.
nquery-querymap.log (14.0 KB ) - added by jimk 14 years ago.
maporaclespatial.c-3187.patch (20.4 KB ) - added by jimk 14 years ago.
SEWERS.zip (4.8 KB ) - added by jimk 14 years ago.
Map file producing the Oracle Error on nquery w/querymap on

Download all attachments as: .zip

Change History (47)

comment:1 by jmckenna, 15 years ago

Cc: pramsey added

comment:2 by jmckenna, 15 years ago

Cc: msmitherdc added; msmith removed

comment:3 by sdlime, 15 years ago

Cc: aboudreault added

I wonder if this could be similar to the leak the was found and fixed with shapefiles (#3188)... It occurred because the shapefile driver allocated memory for things beyond layer->items and layer->numitems, specifically the layer->iteminfo (which is a void * and an array of ints for shapefiles). I'd look at the Oracle version of msLayerGetItems() to see what is allocated at that step.

Steve

comment:4 by aboudreault, 15 years ago

Owner: changed from sdlime to aboudreault
Status: newassigned

comment:5 by sdlime, 15 years ago

Cc: jimk added

comment:6 by aboudreault, 15 years ago

I tried to reproduce the leak here with the benchmarking data (road-merged-oraclespatial layer) but I see no obvious leak. I tried with shp2img (which contains a simple char* memory leak that I fixed) and also a WMS GetMap request, but I see no leak related to mapserver. Does anyone have a request that I can see the leak?

comment:7 by aboudreault, 15 years ago

I'm going to setup a machine with mapserver in fastcgi.

comment:8 by aboudreault, 14 years ago

Cc: sdlime added

I can confirm that the problem is due to the connection caching. The change has been made for the One-pass query implementation (r9358). Steve, what would be the solution to fix this? I would not want to break your rfc 52 implementation.

comment:9 by jimk, 14 years ago

This will need further research... the reason for that change was because otherwise a shared resource was getting free'ed to soon which would cause a second (or later) layer from the same database connection to fail.

comment:10 by aboudreault, 14 years ago

Do we still consider that bug as a blocker or we push it to 5.6.1 ? jimk, sdlime do you have an idea of what could be done for the release ? Maybe a temporary fix ?

comment:11 by sdlime, 14 years ago

I have to defer to Jim on this since he's done all the work with the Oracle code. Is a connection cache different than the pooling supported by FastCGI? The one-pass query processing relies on db connections and result set handles being persistent over a longer period of time, perhaps across a number of presentations (e.g. template and query map). The only real safe time to close things in that context is in the msFreeMap() stuff via msLayerClose().

I think this ones a blocker since oracle is really only usable under FastCGI, correct?

Steve

in reply to:  11 ; comment:12 by jimk, 14 years ago

We (St. Paul) use OracleSpatial all the time w/o FastCGI. (And hence I don't have a FastCGI instance to test with.)

I could look at the code again, but it is my understanding that everything should be cleaned up when the connection pool is released (and by connection pool, I mean using the connection pool that exist with one cgi call between layers to the same data source.) Does FastCGI act differently?

Also, maybe what is really going on is the cache is getting created multiple times and freed once...

in reply to:  12 ; comment:13 by jimk, 14 years ago

Replying to jimk:

I am seeing with valgrind on one of our layers that

OCIEnvCreate and OCILogon both called from msOCISetHandlers are leaking (w/o fastcgi) so I think I can troubleshoot this.

in reply to:  13 comment:14 by jimk, 14 years ago

It appears that as layers are opened and closed that the environment handles, statement handles, logon/logoff, etc. are all getting called correctly (at least for mode=map w/o fastcgi). This makes me wonder how r9358 could have caused this.

It also appears that there is a 907,xxx byte leak in the OCI library that is created as soon as any OCI calls are made and they do not get cleaned up until after the process terminates. They also don't appear to grow during the process lifetime with the number of layers drawn. This leak occurs in even the a simple OCI test app that doesn't even connect to a DB, so I don't think it is a worry.

Is msLayerClose getting called enough times in FastCGI to actually close the layer? (msLayerClose needs to be called at least once for every time msLayerOpen is called to release the Oracle resources.)

Any ideas on how to troubleshoot further?

comment:15 by aboudreault, 14 years ago

Jimk, the msLayerOpen and msLayerClose are called on each request. I'm going to attach a log exemple produced by a mapserver in fastcgi.. if that can help. Let me know if I can do anything else.

by aboudreault, 14 years ago

Attachment: oracle-fcgi.log added

comment:16 by dmorissette, 14 years ago

Does anyone have a bright idea to solve this one?

comment:17 by jimk, 14 years ago

I find it interesting the layer is opened once and closed twice, but this shouldn't cause a problem.

Also I see msConnPoolRegister and msConnPoolRelease are both called once.

Can you tell if msOCICloseHandlers is ever getting called by the connection pool mechanism? I don't see a ready made debug statement there so you will have to add a debug line or break on it in a debugger.

comment:18 by aboudreault, 14 years ago

After the test, msOCICloseHandlers is getting called as expected at the end, if we close apache.

comment:19 by sdlime, 14 years ago

Jimk: The layer is closed in a regular draw operation and again in the map free'ing stage. That would have been the case in prior releases too.

Can anyone sum up the problem at this point now that some debugging as been done? Do we know what data are being leaked? I wish I had a way to test but I have no access to an Oracle server.

Steve

in reply to:  18 comment:20 by jimk, 14 years ago

Replying to aboudreault:

After the test, msOCICloseHandlers is getting called as expected at the end, if we close apache.

Does this mean that the connection pool is not cleaned up until apache is shutdown (apachectl stop)?

in reply to:  19 comment:21 by jimk, 14 years ago

Replying to sdlime:

Jimk: The layer is closed in a regular draw operation and again in the map free'ing stage. That would have been the case in prior releases too.

Ok, it isn't a problem, it just seemed strange.

Can anyone sum up the problem at this point now that some debugging as been done? Do we know what data are being leaked? I wish I had a way to test but I have no access to an Oracle server.

My best guess is the Oracle connection handle (which includes a rowset cache) is not being closed when the last reference to the layer is released via the connection pool. I would also guess that when the layer is re-opened (on a later request) that a new connection (and rowset cache) is created.

comment:22 by aboudreault, 14 years ago

Jimk, yes the connection pool is only cleaned when apache is shutdown.

What do you mean by "the Oracle connection handle is not being closed when the last reference to the layer is released" ? The Oracle connection handle shouldn't be closed at all until the apache is shutdown, else the CLOSE_CONNECTION=DEFER option wouldn't be useful. And checking the log file attached.. I confirm that only one oracle connection handle is created. I don't know why the oracle connection cache is always growing and never freed ... but the only fix I have been able to do is to place a OCICacheFree() call in the layerClose() function. I don't know if the problem may be related to object pined that aren't unpined...

I have a virtual machine set with mapserver in fastcgi with a working mapfile and an oracle layer ... and I can create account to anyone if you need so.

in reply to:  22 comment:23 by jimk, 14 years ago

I'll try putting that back, but I know I moved it from msLayerClose to msOCICloseHandlers to fix a seg-fault. It is possible that change was just covering for something else that is now fixed.

comment:24 by jimk, 14 years ago

Looks like it works. See attached.

by jimk, 14 years ago

Attachment: maporaclespatial.c.patch added

comment:25 by aboudreault, 14 years ago

Ok, so you can confirm that your following comment in the one-pass query can be ignored and the we can revert that change in trunk? Hopefully you can redo the same you did when you got the seg-fault: http://trac.osgeo.org/mapserver/ticket/3069#comment:54

in reply to:  25 comment:26 by jimk, 14 years ago

Replying to aboudreault:

I can't reproduce the seg-fault now. I wonder if this change was a side effect of the one-pass code closing layers too soon (which was an early bug w.r. to query maps)

I have tested with the patch above which is a little different than just rolling back r9358. With the above patch to trunk I left the OCICacheFree in msOCICloseHandlers and in both functions OCICacheFree is protected by an if statement to make sure its arguments are valid.

comment:27 by aboudreault, 14 years ago

And if remove the if protection in the layerClose() fonction, Is the seg-fault happen? Just want to be sure that if we revert the changeset, there will be nothing broken related with the one-pass query implementation. If this last test fails, the seg-fault you got wasn't related to that call and I will revert it (keeping the if protection).

in reply to:  27 ; comment:28 by jimk, 14 years ago

Replying to aboudreault:

Ok... I missed something from before. Querymaps need to be enabled to trigger the bug.

With querymaps enabled, OCICacheFree in msLayerClose causes an error with or without the if protection. So... the patch from today does cause a problem.

msDrawMap(): Image handling error. Failed to draw layer named 'SANITARY'. msOracleSpatialLayerResultGetShape(): OracleSpatial error. msOracleSpatialLayerResultGetShape could not fetch specified record. msOracleSpatialLayerResultGetShape: OracleSpatial error. ORA-21710: argument is expecting a valid memory address of an object 

by jimk, 14 years ago

Attachment: nquery-querymap.log added

in reply to:  28 comment:29 by jimk, 14 years ago

Replying to jimk:

It looks like OCICacheFree is too broad to call in msLayerClose because according to the Oracle Docs it releases more or less all objects held by the connection (which is shared via the connection pool between layers).

I'm looking for a more specific way to release resources at the layer level. Suggestions are welcome.

comment:30 by jimk, 14 years ago

It looks like the stuff in dthand is getting allocated for each layer and doesn't get cleaned up until the connection is closed.

I merged dthand and hand (msOracleSpatialDataHandler and msOracleSpatialHandler) because they both appear global to the connection. This way the describe and pin of SDO_GEOMETRY is saved in the connection handle and returned from the connection pool instead of being re-generated for each layer.

I never really understood why those two were separate in the first place, but they were like that when I found it so I had left it alone.

Please give the maporaclespatial.c-3187.patch a try.

by jimk, 14 years ago

comment:31 by aboudreault, 14 years ago

Unfortunately, the last patch has not changed anything in the leak bug.

comment:32 by aboudreault, 14 years ago

Jimk, I've tried to reproduce your initial bug with the OCICacheFree call in the layerClose() function: In my map, I have 2 oracle layers using connection=defer and I'm doing a wms GetFeatureInfo. Is it supposed to crash with the code ORA-21710? If the case I'm trying is wrong, may you provide me more info about how to reproduce your bug.

comment:33 by jimk, 14 years ago

QUERY_STRING="map=SEWERS/SANITARY.map&mapext=562972.8473137183 156217.9245356886 564396.2470919212 156634.23923404876&mapsize=718 210&mode=nquery&searchmap=true"

by jimk, 14 years ago

Attachment: SEWERS.zip added

Map file producing the Oracle Error on nquery w/querymap on

comment:34 by aboudreault, 14 years ago

still trying to reproduce that with trunk, with a similar request using gmap and 2 oracles layers.. unsucessfully for the moment.

comment:35 by aboudreault, 14 years ago

jimk, are you sure you reproduced that bug with the trunk and the mapfile/templates you attached? I was looking your templates and I noticed that you template do not have the "template magic string" required since the implementation of rfc 56.

I tried to reproduce the bug but I'm not able to. Checking closely your nquery-querymap.log file, I also noticed that you have a call to msOracleSpatialLayerClose at line 92, then immediately a call to msOracleSpatialLayerResultGetShape. How come ? In my test, all layers were closed at the end, after the FreeMap, so all GetShape() calls are made before any layer close.

What is your oracle server/client version? I am using 11.1 on both.

I would need a test case with data if I want to go further into that bug.

in reply to:  35 comment:36 by jimk, 14 years ago

Replying to aboudreault:

Yeah. I commented the magic string part out of the template code because it was a migration headache (If I put the magic string in the template code the old mapserver versions break because

<!-- MapServer Template -->
<? xml version="1.0" ?>

doesn't cut it for the start of an XML doc. I'll fix all the templates when 5.6 is stable. Also, the bug occurs before MapServer notices that the templates don't have the magic string (if the the magic string check is left in tacked.

I have just verified the bug with r9549. Keep in mind it only occurs if I put the OCICacheFree call back into msLayerClose and query maps are on. It does not occur for r9549 proper or with r9549 with my second patch applied.

So summary: The first patch (which rolls back r9358 but with some added sanity checks) when applied to r9549 breaks things for me.

The msOracleSpatialLayerClose call at line 92 is related to query maps being on. If I turn Query Maps off then I do not see a msOracleSpatialLayerClose until after all of the msOracleSpatialLayerResultGetShape calls. Are you sure your test tried to draw a query map?

Oracle Server is 11.1.0.7, Client is 10.2.0.1. Both are on SLES 10sp1 x86_64.

I am unfortunately not in a position where I can give out the data behind this without someone signing something (these are not my rules).

I can try compiling against the 11.1 client and see if that changes things.

Otherwise, can you help me get FastCGI going so I can recreate this bug?

comment:37 by aboudreault, 14 years ago

jimk, thanks! I finaly reproduced the bug adding the querymap status on in the mapfile. I'll try to see if that couldn't be fixed.

Also If you want to see yourself the memory leak problem, I can easily create an account for you on my virtual machine. It is already set and functionnal. (I can of course help you too to setup something similar)

comment:38 by jimk, 14 years ago

Good news.

If you don't mind setting up an account, I would like to see the bug in action.

comment:39 by aboudreault, 14 years ago

Please, send me a mail, I'm going to send you the account info. aboudreault@…

comment:40 by aboudreault, 14 years ago

After a better understanding of what the new single query pass implementation does and how Oracle handle his cache, a solution that we found would be to add a "ref counter" into the oracle connection handler (which is shared between layers). Daniel confirmed me that the oracle single queries implementation had no choice to be able to deal with the fact that a layer can be closed at anytime and should not affect the connection handler and its cache.

The ref counter would be increased in the msLayerOpen() function and decreased in msLayerClose(). Then, when we close a layer, we can verify if there is other opened layers that could still use the oci cache objects and free the cache if not. That way, we ensure that the cache will not be freed during map queries and also that the cache is cleared properly after each request (to avoid memory leak in fastcgi).

This fix is probably temporary since jimk would like to rewrite some parts of the driver and cleaning up in the future.

Jimk should check another small problem (segfault) soon and we'll commit the changes.

comment:41 by jimk, 14 years ago

My 2 cents is to commit the patch as is. The problem I found will eventually need to be resolved, however, the ref-counting patch is better than trunk for at least two known use cases and worse for no known case.

The problem I found is fairly obscure... it requires oracle spatial layers + FastCGI + a query mode + query maps on. The bug is a seg-fault that occurs on the second request to the FastCGI process.

comment:42 by dmorissette, 14 years ago

Resolution: fixed
Status: assignedclosed

Alan committed a fix for the OCI cache leak using a ref counter in r9556, so this ticket can be closed.

Jim, I have created ticket #3210 for your seg fault since it does not appear to be directly related to the memory leak that was the object of the current ticket.

Note: See TracTickets for help on using tickets.