Opened 12 years ago

Last modified 11 years ago

#995 reopened defect

Exception thrown in child thread causes GN to stall?

Reported by: simonp Owned by: geonetwork-devel@…
Priority: critical Milestone: v2.10.0 RC0
Component: Catalog server Version: v2.8.0
Keywords: Cc:

Description

Seems that if an exception is thrown in a child thread (eg. harvester or when rebuilding thesauri on startup), GN may stall ie. become unresponsive because services hang on dispatch. More research required - has anyone else noticed this?

Change History (4)

comment:1 by simonp, 12 years ago

Resolution: invalid
Status: newclosed

Turns out that this is not due to GN and also not due to the exception in the child thread.

The symptoms in more detail are that GN would either stall in one service or become entirely unresponsive - services would dispatch but hang - kill -3 on the java process id shows stack traces of all threads which seemed to be deadlocked during 'parking'. A consistent problem was 'batch import' service - util.import: stack traces showed threads related to dbms and futures trying to 'park'.

This turns this was happening on just one Linux amd64 machine I had access to which was running ubuntu 10.10 (kernel 2.6.32-41-server) with GN using an Oracle db. It doesn't occur on other machines I've tested running later versions of ubuntu and linux kernel (eg. 12.04 and 3.0.29) - jdk 1.6 (version doesn't seem to matter - tried all the most recent ones from 1.6.0_30 onwards).

I was able to work around this by altering util.import to use a single threaded approach so I guess the problem lies not in GN but in the 1.6 JVM or most likely in the linux kernel being used as it doesn't happen on later versions linux with 1.6 JVM. I'm closing this as invalid and describing the symptoms in case anyone comes across anything similar.

comment:2 by landry, 11 years ago

I think i've reproduced similar symptoms, but in a different usecase : ie a user hitting 'save and validate'.

In certain circumstances, two threads are fighting for the data access.

  • this one is waiting to do the validation :

"http-9080-9" daemon prio=10 tid=0x000000004204c000 nid=0x5da3 waiting for monitor entry [0x00007fd5349c1000]

java.lang.Thread.State: BLOCKED (on object monitor)

at org.fao.geonet.kernel.DataManager.getXSDXmlReport(DataManager.java:926) waiting to lock <0x00000000aa0cfcb0> (at org.fao.geonet.kernel.DataManager) at org.fao.geonet.kernel.DataManager.doValidate(DataManager.java:1874) at org.fao.geonet.services.metadata.AjaxEditUtils.validateMetadataEmbedded(AjaxEditUtils.java:576) at org.fao.geonet.services.metadata.Validate.exec(Validate.java:76)

  • this one took a lock in updateMetadata/updateFixedInfo, and seems to never return.

"http-9080-3" daemon prio=10 tid=0x0000000041f42800 nid=0x5d1c runnable [0x00007fd5359b5000]

java.lang.Thread.State: RUNNABLE

at java.lang.Throwable.fillInStackTrace(Native Method)

......

at net.sf.saxon.instruct.ApplyTemplates.applyTemplates(ApplyTemplates.java:333) at net.sf.saxon.Controller.transformDocument(Controller.java:1807) at net.sf.saxon.Controller.transform(Controller.java:1621) at jeeves.utils.Xml.transform(Xml.java:477) at jeeves.utils.Xml.transform(Xml.java:362) at org.fao.geonet.kernel.DataManager.updateFixedInfo(DataManager.java:2590) at org.fao.geonet.kernel.DataManager.updateMetadata(DataManager.java:1716)

At that point, geonetwork takes 100% cpu, and a tomcat restart is mandatory. I'll dig a bit more to figure out why updateFixedInfo timeouts, this might be a problem in my schemas..

comment:3 by landry, 11 years ago

I'm not proficient in java programming, but the full trace of the thread doing updatedFixedInfo looks bizarre :

"http-9080-3" daemon prio=10 tid=0x0000000041f42800 nid=0x5d1c runnable [0x00007fd5359b5000]
   java.lang.Thread.State: RUNNABLE
        at java.lang.Throwable.fillInStackTrace(Native Method)
        - locked <0x00000000a0703200> (a java.util.ConcurrentModificationException)
        at java.lang.Throwable.<init>(Throwable.java:181)
        at java.lang.Exception.<init>(Exception.java:29)
        at java.lang.RuntimeException.<init>(RuntimeException.java:32)
        at java.util.ConcurrentModificationException.<init>(ConcurrentModificationException.java:57)
        at java.util.TreeMap$PrivateEntryIterator.nextEntry(TreeMap.java:1100)
        at java.util.TreeMap$ValueIterator.next(TreeMap.java:1145)
        at eu.medsea.mimeutil.MimeDetectorRegistry.getMimeTypes(MimeUtil2.java:1033)
        at eu.medsea.mimeutil.MimeUtil2.getMimeTypes(MimeUtil2.java:428)
        at eu.medsea.mimeutil.MimeUtil2.getMimeTypes(MimeUtil2.java:395)
        at eu.medsea.mimeutil.MimeUtil.getMimeTypes(MimeUtil.java:281)
        at org.fao.geonet.util.MimeTypeFinder.detectMimeTypeFile(MimeTypeFinder.java:78)
        at sun.reflect.GeneratedMethodAccessor278.invoke(Unknown Source)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
.....
lots of saxon internals
.....

        at net.sf.saxon.Controller.transformDocument(Controller.java:1807)
        at net.sf.saxon.Controller.transform(Controller.java:1621)
        at jeeves.utils.Xml.transform(Xml.java:477)
        at jeeves.utils.Xml.transform(Xml.java:362)
        at org.fao.geonet.kernel.DataManager.updateFixedInfo(DataManager.java:2590)
        at org.fao.geonet.kernel.DataManager.updateMetadata(DataManager.java:1716)
        - locked <0x00000000aa0cfcb0> (a org.fao.geonet.kernel.DataManager)
        at org.fao.geonet.services.metadata.EditUtils.updateContent(EditUtils.java:171)
        at org.fao.geonet.services.metadata.AjaxEditUtils.updateContent(AjaxEditUtils.java:34)
        at org.fao.geonet.services.metadata.Update.exec(Update.java:114)

The thread calls in saxon while having the lock, then comes back to geonetwork in org.fao.geonet.util.MimeTypeFinder.detectMimeTypeFile. Is it safe, thread-wise ?

It also triggers a ConcurrentModificationException(), but i cant figure out if its deadlocking with itself or with the other thread.

comment:4 by landry, 11 years ago

Component: GeneralCatalog server
Milestone: v2.8.0 RC0v2.9.0
Resolution: invalid
Status: closedreopened
Version: v2.8.0RC0v2.8.0

Reopening, since i had that issue again yesterday on 2.8.x. Same conditions, xml.metadata.validate service never returning and eating 100% cpu.

Note: See TracTickets for help on using tickets.