Opened 14 years ago

Closed 14 years ago

#175 closed defect (fixed)

Index of 20,000 records maxes out java heap space

Reported by: ddnebert Owned by: geonetwork-devel@…
Priority: major Milestone: v2.4.2
Component: General Version: v2.4.2
Keywords: java heap, index, harvest Cc: Simon.Pigot@…

Description

We have tried all manner of memory settings on the jetty servlet to enable the indexing of 20K records in ISO format, each only ~40kb in size. The current setting is: java -Xms48m -Xmx1024m -Xss36m with the guidance being that one does not want to exceed the overall process limits of 2GB on this 32-bit Linux environment.

We are using the Harvest Management function to access Local File System; the metadata are in ISO 19139 valid format; running 2.4.2 under Linux 32-bit with Jetty and MySQL.

We don't want to split up the files into smaller groups as this defeats the purpose of identifying and working with a collection and setting harvest rules on it. There appears to be a defect in the code that does not step the commitment of records being harvested until too late in the process, maxing-out the java heap space. the process should be more serial and progressive, allowing for records to be visible soon after they are processed. As it stands, the system is not viable for use on these larger collections until the defect is fixed.

The 113MB file of metadata records is available at http://mapcontext.com/gcmdiso.zip for testing purposes.

Change History (1)

comment:1 by simonp, 14 years ago

Resolution: fixed
Status: newclosed

The way the local filesystem harvester currently works is that all XML files in the directory tree are read and stored as JDOM Element objects before they are written to the database - 20,000 XML files in your directory tree will likely cause you to run out of memory/thrash before anything ever gets written (did for me anyway).

A simple modification to the harvester is to capture the filename only when the directory tree is traversed and defer reading it into a JDOM Element object until we need to write it to the database - much less heavy on RAM and its working now on your 20,000 record test case. I'll commit the change later tonight if all is ok.

Note this is not an indexing issue - just a memory use oversight for large harvests - and it only applies to the local filesystem harvester.

Update: fixed in svn rev 5565 (trunk), 5566 (branches 2.4.x)

Note: See TracTickets for help on using tickets.