Bug 802003

Summary: embedded JVM discovery code tries to discover JVM process and connect to it via JMX remoting, rather than just using its parent Tomcat or JBoss AS Resource's EMSConnection, which is what it's supposed to do
Product: [Other] RHQ Project Reporter: Elias Ross <genman>
Component: Plugin ContainerAssignee: Charles Crouch <ccrouch>
Status: CLOSED CURRENTRELEASE QA Contact: Mike Foley <mfoley>
Severity: high Docs Contact:
Priority: high    
Version: 4.3, 4.4CC: ccrouch, dlackey, genman, hbrock, hrupp, jshaughn, loleary, mazz
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
See Also: https://bugzilla.redhat.com/show_bug.cgi?id=819521
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
: 823704 825300 (view as bug list) Environment:
Last Closed: 2013-09-12 18:02:13 EDT Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Bug Depends On:    
Bug Blocks: 823704, 825300    

Description Elias Ross 2012-03-10 01:52:42 EST
Description of problem:

This message is output repeatedly in the logfile:

2012-03-10 06:42:41,984 ERROR [ResourceDiscoveryComponent.invoker.daemon-308] (rhq.core.pc.util.DiscoveryComponentProxyFactory)- Thread [ResourceDiscoveryComponent.invoker.daemon-308] was interrupted.
2012-03-10 06:42:41,984 WARN  [ResourceDiscoveryComponent.invoker.daemon-308] (rhq.core.pluginapi.inventory.ResourceContext)- Cannot get native process for resource [JVM] - discovery failed
java.lang.RuntimeException: Call to [org.rhq.plugins.jmx.EmbeddedJMXServerDiscoveryComponent.discoverResources()] with args [[org.rhq.core.pluginapi.inventory.ResourceDiscoveryContext@797d8adf]] was interrupted.
        at org.rhq.core.pc.util.DiscoveryComponentProxyFactory$ResourceDiscoveryComponentInvocationHandler.invokeInNewThread(DiscoveryComponentProxyFactory.java:224)
        at org.rhq.core.pc.util.DiscoveryComponentProxyFactory$ResourceDiscoveryComponentInvocationHandler.invoke(DiscoveryComponentProxyFactory.java:207)
        at $Proxy43.discoverResources(Unknown Source)
        at org.rhq.core.pluginapi.inventory.ResourceContext.getNativeProcess(ResourceContext.java:230)
        at org.rhq.plugins.jmx.EmbeddedJMXServerDiscoveryComponent.discoverResources(EmbeddedJMXServerDiscoveryComponent.java:70)
        at sun.reflect.GeneratedMethodAccessor36.invoke(Unknown Source)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at org.rhq.core.pc.util.DiscoveryComponentProxyFactory$ComponentInvocationThread.call(DiscoveryComponentProxyFactory.java:292)
        at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
        at java.util.concurrent.FutureTask.run(FutureTask.java:138)
        at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
        at java.lang.Thread.run(Thread.java:619)
Caused by: java.lang.InterruptedException
        at java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1238)
        at java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:227)
        at java.util.concurrent.FutureTask.get(FutureTask.java:91)
        at org.rhq.core.pc.util.DiscoveryComponentProxyFactory$ResourceDiscoveryComponentInvocationHandler.invokeInNewThread(DiscoveryComponentProxyFactory.java:220)
        ... 13 more

And 'Tomcat Server JVM' appears unavailable.

Version-Release number of selected component (if applicable):

rhq-jmx-plugin-4.3.0-SNAPSHOT.jar


How reproducible:

Seems to happen with Tomcat discovery.
Comment 1 Elias Ross 2012-03-10 02:18:53 EST
Analysis.

The call to 'getNativeProcess' in the plugin actually works asynchronously, but never begins since the interrupted state is set already.

It could be there was a timeout that was never cleared for this thread.

public class EmbeddedJMXServerDiscoveryComponent implements ResourceDiscoveryComponent<JMXComponent<?>> {

            ProcessInfo nativeProcess = context.getParentResourceContext().getNativeProcess();

...

When starting the agent:

2012-03-10 07:12:35,014 INFO  [ResourceContainer.invoker.daemon-1] (org.mc4j.ems.connection.ConnectionFactory)- Connection library dependancy [catalina-cluster.jar] not found  in directory: /usr/local/tomcat
2012-03-10 07:12:35,032 INFO  [ResourceContainer.invoker.daemon-1] (org.mc4j.ems.connection.ConnectionFactory)- Connection library dependancy [catalina-optional.jar] not found  in directory: /usr/local/tomcat
2012-03-10 07:12:35,032 INFO  [ResourceContainer.invoker.daemon-1] (org.mc4j.ems.connection.ConnectionFactory)- Discovered libraries in 56 ms
2012-03-10 07:12:35,032 INFO  [ResourceContainer.invoker.daemon-1] (jboss.on.plugins.tomcat.TomcatServerComponent)- Loading connection [service:jmx:rmi:///jndi/rmi://localhost:8889/jmxrmi] with install path [/usr/local/tomcat] and temp directory [/usr/local/rhq-agent/data/tmp]

2012-03-10 07:12:35,217 INFO  [ResourceContainer.invoker.daemon-1] (ems.impl.jmx.connection.DConnection)- Querying MBeanServer for all MBeans...
2012-03-10 07:12:35,222 INFO  [ResourceContainer.invoker.daemon-1] (ems.impl.jmx.connection.DConnection)- Found 133 MBeans - starting load...

and it seems that it 'sticks' at this message for a bit and then:

2012-03-10 07:13:35,244 ERROR [ResourceContainer.invoker.daemon-1] (rhq.core.pc.util.DiscoveryComponentProxyFactory)- Thread [ResourceContainer.invoker.daemon-1] was interrupted.2012-03-10 07:13:35,244 ERROR [ResourceDiscoveryComponent.invoker.daemon-1] (rhq.core.pc.util.DiscoveryComponentProxyFactory)- Thread [ResourceDiscoveryCompon
ent.invoker.daemon-1] was interrupted.2012-03-10 07:13:35,245 ERROR [ResourceDiscoveryComponent.invoker.daemon-2] (rhq.core.pc.util.DiscoveryComponentProxyFactory)- Thread [ResourceDiscoveryCompon
ent.invoker.daemon-2] was interrupted.2012-03-10 07:13:35,252 ERROR [ResourceDiscoveryComponent.invoker.daemon-3] (rhq.core.pc.util.DiscoveryComponentProxyFactory)- Thread [ResourceDiscoveryCompon

... 

So there's at least two problems:

1) Interrupted state needs to be cleared upon timeout
2) There's a timeout happening that probably shouldn't

And why are all these threads getting interrupted when it's on the MBean discovery blocking?

I don't understand the threading design so there's a lot of questions I have...
Comment 2 Elias Ross 2012-03-10 02:20:35 EST
Here's what I see immediately after (notice the time)

2012-03-10 07:13:35,244 ERROR [ResourceDiscoveryComponent.invoker.daemon-1] (rhq.core.pc.util.DiscoveryComponentProxyFactory)- Thread [ResourceDiscoveryComponent.invoker.daemon-1] was interrupted.
2012-03-10 07:13:35,245 ERROR [ResourceDiscoveryComponent.invoker.daemon-2] (rhq.core.pc.util.DiscoveryComponentProxyFactory)- Thread [ResourceDiscoveryComponent.invoker.daemon-2] was interrupted.
2012-03-10 07:13:35,252 ERROR [ResourceDiscoveryComponent.invoker.daemon-3] (rhq.core.pc.util.DiscoveryComponentProxyFactory)- Thread [ResourceDiscoveryComponent.invoker.daemon-3] was interrupted.
2012-03-10 07:13:35,252 ERROR [ResourceDiscoveryComponent.invoker.daemon-4] (rhq.core.pc.util.DiscoveryComponentProxyFactory)- Thread [ResourceDiscoveryComponent.invoker.daemon-4] was interrupted.
2012-03-10 07:13:35,252 ERROR [ResourceDiscoveryComponent.invoker.daemon-5] (rhq.core.pc.util.DiscoveryComponentProxyFactory)- Thread [ResourceDiscoveryComponent.invoker.daemon-5] was interrupted.
2012-03-10 07:13:35,252 ERROR [ResourceDiscoveryComponent.invoker.daemon-6] (rhq.core.pc.util.DiscoveryComponentProxyFactory)- Thread [ResourceDiscoveryComponent.invoker.daemon-6] was interrupted.
2012-03-10 07:13:35,252 ERROR [ResourceDiscoveryComponent.invoker.daemon-7] (rhq.core.pc.util.DiscoveryComponentProxyFactory)- Thread [ResourceDiscoveryComponent.invoker.daemon-7] was interrupted.
2012-03-10 07:13:35,252 ERROR [ResourceDiscoveryComponent.invoker.daemon-8] (rhq.core.pc.util.DiscoveryComponentProxyFactory)- Thread [ResourceDiscoveryComponent.invoker.daemon-8] was interrupted.
2012-03-10 07:13:35,252 ERROR [ResourceDiscoveryComponent.invoker.daemon-9] (rhq.core.pc.util.DiscoveryComponentProxyFactory)- Thread [ResourceDiscoveryComponent.invoker.daemon-9] was interrupted.
2012-03-10 07:13:35,253 ERROR [ResourceDiscoveryComponent.invoker.daemon-10] (rhq.core.pc.util.DiscoveryComponentProxyFactory)- Thread [ResourceDiscoveryComponent.invoker.daemon-10] was interrupted.
2012-03-10 07:13:35,253 ERROR [ResourceDiscoveryComponent.invoker.daemon-11] (rhq.core.pc.util.DiscoveryComponentProxyFactory)- Thread [ResourceDiscoveryComponent.invoker.daemon-11] was interrupted.
2012-03-10 07:13:35,253 ERROR [ResourceDiscoveryComponent.invoker.daemon-12] (rhq.core.pc.util.DiscoveryComponentProxyFactory)- Thread [ResourceDiscoveryComponent.invoker.daemon-12] was interrupted.
2012-03-10 07:13:35,253 ERROR [ResourceDiscoveryComponent.invoker.daemon-13] (rhq.core.pc.util.DiscoveryComponentProxyFactory)- Thread [ResourceDiscoveryComponent.invoker.daemon-13] was interrupted.
2012-03-10 07:13:35,253 ERROR [ResourceDiscoveryComponent.invoker.daemon-14] (rhq.core.pc.util.DiscoveryComponentProxyFactory)- Thread [ResourceDiscoveryComponent.invoker.daemon-14] was interrupted.
2012-03-10 07:13:35,253 ERROR [ResourceDiscoveryComponent.invoker.daemon-15] (rhq.core.pc.util.DiscoveryComponentProxyFactory)- Thread [ResourceDiscoveryComponent.invoker.daemon-15] was interrupted.
2012-03-10 07:13:35,253 ERROR [ResourceDiscoveryComponent.invoker.daemon-16] (rhq.core.pc.util.DiscoveryComponentProxyFactory)- Thread [ResourceDiscoveryComponent.invoker.daemon-16] was interrupted.
Comment 3 Mike Foley 2012-03-12 11:34:14 EDT
priority = high, target 3.1 per bz traige (crouch, foley, loleary)
Comment 4 Elias Ross 2012-03-12 21:30:05 EDT
With more debug enabled I see this:

2012-03-13 01:25:18,024 DEBUG [ResourceDiscoveryComponent.invoker.daemon-742] (org.rhq.core.system.ProcessInfo)- Retrieval of process info for pid 650 took 0 ms.
2012-03-13 01:25:18,024 DEBUG [ResourceDiscoveryComponent.invoker.daemon-742] (org.rhq.core.system.ProcessInfo)- Unexpected error occurred while looking up info for [<unknown>] process with pid [651] - call to getProcExe failed. Did the process die? Cause: org.hyperic.sigar.SigarFileNotFoundException: No such file or directory
2012-03-13 01:25:18,024 DEBUG [ResourceDiscoveryComponent.invoker.daemon-742] (org.rhq.core.system.ProcessInfo)- Retrieval of process info for pid 651 took 0 ms.
2012-03-13 01:25:18,024 DEBUG [ResourceDiscoveryComponent.invoker.daemon-742] (org.rhq.core.system.ProcessInfo)- Unexpected error occurred while looking up info for [<unknown>] process with pid [652] - call to getProcExe failed. Did the process die? Cause: org.hyperic.sigar.SigarFileNotFoundException: No such file or directory
...

2012-03-13 01:25:18,038 DEBUG [ResourceDiscoveryComponent.invoker.daemon-742] (org.rhq.core.system.ProcessInfo)- Retrieval of process info for pid 3013 took 1 ms.
2012-03-13 01:25:18,038 DEBUG [ResourceDiscoveryComponent.invoker.daemon-742] (org.rhq.core.system.ProcessInfo)- Retrieval of process info for pid 3825 took 0 ms.
2012-03-13 01:25:18,038 DEBUG [ResourceDiscoveryComponent.invoker.daemon-742] (org.rhq.core.system.ProcessInfo)- Retrieval of process info for pid 3828 took 0 ms.
2012-03-13 01:25:18,038 DEBUG [ResourceDiscoveryComponent.invoker.daemon-742] (org.rhq.core.system.ProcessInfo)- Retrieval of process info for pid 3837 took 0 ms.
2012-03-13 01:25:18,039 DEBUG [ResourceDiscoveryComponent.invoker.daemon-742] (org.rhq.core.system.ProcessInfo)- Retrieval of process info for pid 3850 took 1 ms.
2012-03-13 01:25:18,039 DEBUG [ResourceDiscoveryComponent.invoker.daemon-742] (org.rhq.core.system.ProcessInfo)- Retrieval of process info for pid 3859 took 0 ms.
2012-03-13 01:25:18,039 DEBUG [ResourceDiscoveryComponent.invoker.daemon-742] (org.rhq.core.system.ProcessInfo)- Retrieval of process info for pid 3885 took 0 ms.
2012-03-13 01:25:18,039 DEBUG [ResourceDiscoveryComponent.invoker.daemon-742] (org.rhq.core.system.ProcessInfo)- Retrieval of process info for pid 3948 took 0 ms.
2012-03-13 01:25:18,040 DEBUG [ResourceDiscoveryComponent.invoker.daemon-742] (org.rhq.core.system.ProcessInfo)- Retrieval of process info for pid 3956 took 1 ms.
2012-03-13 01:25:18,040 DEBUG [ResourceDiscoveryComponent.invoker.daemon-742] (org.rhq.core.system.ProcessInfo)- Retrieval of process info for pid 3984 took 0 ms.
rocess with pid [4077] - call to getProcExe failed. Did the process die? Cause: org.hyperic.sigar.SigarFileNotFoundException: No such file or directory

... etc etc

I think this could be related or the cause.

Also Note: This is a 5.3/5.4 plugin running on a 5.2 agent.
Comment 5 John Mazzitelli 2012-04-09 17:13:23 EDT
I have not been able to replicate this. Looking at the logs, it appears the retrieval of the MBean metadata took a long time and that caused the timeout. The call into the discovery component is done asynchronously in a thread pool to place a timeout on the call - in this case, it appears the timeout is 60 seconds.
Comment 6 Ian Springer 2012-05-10 16:38:56 EDT
Wow, I haven't looked closely at EmbeddedJMXServerDiscoveryComponent in a long time. I think it is doing a lot of stuff it should not be, including the getNativeProcess() call in order to look for the jmx remoting sysprop. EmbeddedJMXServerDiscoveryComponent's parent component is already a JMXComponent. As far as I'm concerned, it should -always- just delegate the parent component's EMSConnection to do its JMX stuff. It should not be doing any JMX server discovery of its own. This appears to have regressed way back in 2008.
Comment 7 Ian Springer 2012-05-10 21:54:16 EDT
This is a manifestation of two problems:

1) ResourceContext.getNativeProcess() is incredibly inefficient - it often triggers unnecessary discovery scans, which can take some time (I'm not surprised you hit the 60s discovery facet timeout. We already have a BZ for this issue (bug 819521), which I'm hoping we can address in RHQ 4.5.

2) As I mentioned in my previous comment, EmbeddedJMXServerDiscoveryComponent should not be doing any JVM command line parsing or JMX remoting connections period - instead it should just delegate to its parent component's EMSConnection. I am going to use this BZ to track this issue (I've updated the BZ title to reflect that).
Comment 8 Charles Crouch 2012-05-21 18:18:00 EDT
Commits went into master to address this BZ
http://git.fedorahosted.org/git/?p=rhq/rhq.git;a=commit;h=b3019a68a4fa0a872ea6516c01ca4e17a8e69c91

I'm not sure if this is complete or whether more work is still needed here.
Comment 10 Heiko W. Rupp 2012-05-22 03:03:27 EDT
Sounds like a candidate to be explicitly mentioned in the release notes.
Comment 11 Ian Springer 2012-05-25 10:36:40 EDT
It turns out more work was needed. There was a logic bug that pretty much broke discovery of embedded JVM's, which unfortunately made it into the 4.4 release.

http://git.fedorahosted.org/git?p=rhq/rhq.git;a=commitdiff;h=569559b fixes this bug in master.
Comment 12 Ian Springer 2012-05-25 11:20:57 EDT
It turns out the faulty fix did *not* go into 4.4  :)  I am reopening this, as I may want to do a bit more work on it, to make it as flexible as possible.
Comment 13 Deon Ballard 2012-08-27 15:38:42 EDT
Reassigning to Charles to handle, since this is a customer case and a high priority.
Comment 14 Jay Shaughnessy 2013-09-12 16:39:10 EDT
Setting out of ON_DEV state due to inactivity, and asking Elias if this can be closed...
Comment 15 Elias Ross 2013-09-12 17:50:43 EDT
I haven't seen this issue in newer versions but I'm not clear what was actually done here. My original complaint was about discovery timing out because of the call to 'getNativeProcess'. Also, it seemed that InterruptedException would continue to repeat (was not cleared.)

Feel free to close, however.
Comment 16 Jay Shaughnessy 2013-09-12 18:02:13 EDT
Thanks Elias, between what I see here, and what I see in Bug 823704, I think both can be closed.