Description of problem: This message is output repeatedly in the logfile: 2012-03-10 06:42:41,984 ERROR [ResourceDiscoveryComponent.invoker.daemon-308] (rhq.core.pc.util.DiscoveryComponentProxyFactory)- Thread [ResourceDiscoveryComponent.invoker.daemon-308] was interrupted. 2012-03-10 06:42:41,984 WARN [ResourceDiscoveryComponent.invoker.daemon-308] (rhq.core.pluginapi.inventory.ResourceContext)- Cannot get native process for resource [JVM] - discovery failed java.lang.RuntimeException: Call to [org.rhq.plugins.jmx.EmbeddedJMXServerDiscoveryComponent.discoverResources()] with args [[org.rhq.core.pluginapi.inventory.ResourceDiscoveryContext@797d8adf]] was interrupted. at org.rhq.core.pc.util.DiscoveryComponentProxyFactory$ResourceDiscoveryComponentInvocationHandler.invokeInNewThread(DiscoveryComponentProxyFactory.java:224) at org.rhq.core.pc.util.DiscoveryComponentProxyFactory$ResourceDiscoveryComponentInvocationHandler.invoke(DiscoveryComponentProxyFactory.java:207) at $Proxy43.discoverResources(Unknown Source) at org.rhq.core.pluginapi.inventory.ResourceContext.getNativeProcess(ResourceContext.java:230) at org.rhq.plugins.jmx.EmbeddedJMXServerDiscoveryComponent.discoverResources(EmbeddedJMXServerDiscoveryComponent.java:70) at sun.reflect.GeneratedMethodAccessor36.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.rhq.core.pc.util.DiscoveryComponentProxyFactory$ComponentInvocationThread.call(DiscoveryComponentProxyFactory.java:292) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at java.util.concurrent.FutureTask.run(FutureTask.java:138) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:619) Caused by: java.lang.InterruptedException at java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1238) at java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:227) at java.util.concurrent.FutureTask.get(FutureTask.java:91) at org.rhq.core.pc.util.DiscoveryComponentProxyFactory$ResourceDiscoveryComponentInvocationHandler.invokeInNewThread(DiscoveryComponentProxyFactory.java:220) ... 13 more And 'Tomcat Server JVM' appears unavailable. Version-Release number of selected component (if applicable): rhq-jmx-plugin-4.3.0-SNAPSHOT.jar How reproducible: Seems to happen with Tomcat discovery.
Analysis. The call to 'getNativeProcess' in the plugin actually works asynchronously, but never begins since the interrupted state is set already. It could be there was a timeout that was never cleared for this thread. public class EmbeddedJMXServerDiscoveryComponent implements ResourceDiscoveryComponent<JMXComponent<?>> { ProcessInfo nativeProcess = context.getParentResourceContext().getNativeProcess(); ... When starting the agent: 2012-03-10 07:12:35,014 INFO [ResourceContainer.invoker.daemon-1] (org.mc4j.ems.connection.ConnectionFactory)- Connection library dependancy [catalina-cluster.jar] not found in directory: /usr/local/tomcat 2012-03-10 07:12:35,032 INFO [ResourceContainer.invoker.daemon-1] (org.mc4j.ems.connection.ConnectionFactory)- Connection library dependancy [catalina-optional.jar] not found in directory: /usr/local/tomcat 2012-03-10 07:12:35,032 INFO [ResourceContainer.invoker.daemon-1] (org.mc4j.ems.connection.ConnectionFactory)- Discovered libraries in 56 ms 2012-03-10 07:12:35,032 INFO [ResourceContainer.invoker.daemon-1] (jboss.on.plugins.tomcat.TomcatServerComponent)- Loading connection [service:jmx:rmi:///jndi/rmi://localhost:8889/jmxrmi] with install path [/usr/local/tomcat] and temp directory [/usr/local/rhq-agent/data/tmp] 2012-03-10 07:12:35,217 INFO [ResourceContainer.invoker.daemon-1] (ems.impl.jmx.connection.DConnection)- Querying MBeanServer for all MBeans... 2012-03-10 07:12:35,222 INFO [ResourceContainer.invoker.daemon-1] (ems.impl.jmx.connection.DConnection)- Found 133 MBeans - starting load... and it seems that it 'sticks' at this message for a bit and then: 2012-03-10 07:13:35,244 ERROR [ResourceContainer.invoker.daemon-1] (rhq.core.pc.util.DiscoveryComponentProxyFactory)- Thread [ResourceContainer.invoker.daemon-1] was interrupted.2012-03-10 07:13:35,244 ERROR [ResourceDiscoveryComponent.invoker.daemon-1] (rhq.core.pc.util.DiscoveryComponentProxyFactory)- Thread [ResourceDiscoveryCompon ent.invoker.daemon-1] was interrupted.2012-03-10 07:13:35,245 ERROR [ResourceDiscoveryComponent.invoker.daemon-2] (rhq.core.pc.util.DiscoveryComponentProxyFactory)- Thread [ResourceDiscoveryCompon ent.invoker.daemon-2] was interrupted.2012-03-10 07:13:35,252 ERROR [ResourceDiscoveryComponent.invoker.daemon-3] (rhq.core.pc.util.DiscoveryComponentProxyFactory)- Thread [ResourceDiscoveryCompon ... So there's at least two problems: 1) Interrupted state needs to be cleared upon timeout 2) There's a timeout happening that probably shouldn't And why are all these threads getting interrupted when it's on the MBean discovery blocking? I don't understand the threading design so there's a lot of questions I have...
Here's what I see immediately after (notice the time) 2012-03-10 07:13:35,244 ERROR [ResourceDiscoveryComponent.invoker.daemon-1] (rhq.core.pc.util.DiscoveryComponentProxyFactory)- Thread [ResourceDiscoveryComponent.invoker.daemon-1] was interrupted. 2012-03-10 07:13:35,245 ERROR [ResourceDiscoveryComponent.invoker.daemon-2] (rhq.core.pc.util.DiscoveryComponentProxyFactory)- Thread [ResourceDiscoveryComponent.invoker.daemon-2] was interrupted. 2012-03-10 07:13:35,252 ERROR [ResourceDiscoveryComponent.invoker.daemon-3] (rhq.core.pc.util.DiscoveryComponentProxyFactory)- Thread [ResourceDiscoveryComponent.invoker.daemon-3] was interrupted. 2012-03-10 07:13:35,252 ERROR [ResourceDiscoveryComponent.invoker.daemon-4] (rhq.core.pc.util.DiscoveryComponentProxyFactory)- Thread [ResourceDiscoveryComponent.invoker.daemon-4] was interrupted. 2012-03-10 07:13:35,252 ERROR [ResourceDiscoveryComponent.invoker.daemon-5] (rhq.core.pc.util.DiscoveryComponentProxyFactory)- Thread [ResourceDiscoveryComponent.invoker.daemon-5] was interrupted. 2012-03-10 07:13:35,252 ERROR [ResourceDiscoveryComponent.invoker.daemon-6] (rhq.core.pc.util.DiscoveryComponentProxyFactory)- Thread [ResourceDiscoveryComponent.invoker.daemon-6] was interrupted. 2012-03-10 07:13:35,252 ERROR [ResourceDiscoveryComponent.invoker.daemon-7] (rhq.core.pc.util.DiscoveryComponentProxyFactory)- Thread [ResourceDiscoveryComponent.invoker.daemon-7] was interrupted. 2012-03-10 07:13:35,252 ERROR [ResourceDiscoveryComponent.invoker.daemon-8] (rhq.core.pc.util.DiscoveryComponentProxyFactory)- Thread [ResourceDiscoveryComponent.invoker.daemon-8] was interrupted. 2012-03-10 07:13:35,252 ERROR [ResourceDiscoveryComponent.invoker.daemon-9] (rhq.core.pc.util.DiscoveryComponentProxyFactory)- Thread [ResourceDiscoveryComponent.invoker.daemon-9] was interrupted. 2012-03-10 07:13:35,253 ERROR [ResourceDiscoveryComponent.invoker.daemon-10] (rhq.core.pc.util.DiscoveryComponentProxyFactory)- Thread [ResourceDiscoveryComponent.invoker.daemon-10] was interrupted. 2012-03-10 07:13:35,253 ERROR [ResourceDiscoveryComponent.invoker.daemon-11] (rhq.core.pc.util.DiscoveryComponentProxyFactory)- Thread [ResourceDiscoveryComponent.invoker.daemon-11] was interrupted. 2012-03-10 07:13:35,253 ERROR [ResourceDiscoveryComponent.invoker.daemon-12] (rhq.core.pc.util.DiscoveryComponentProxyFactory)- Thread [ResourceDiscoveryComponent.invoker.daemon-12] was interrupted. 2012-03-10 07:13:35,253 ERROR [ResourceDiscoveryComponent.invoker.daemon-13] (rhq.core.pc.util.DiscoveryComponentProxyFactory)- Thread [ResourceDiscoveryComponent.invoker.daemon-13] was interrupted. 2012-03-10 07:13:35,253 ERROR [ResourceDiscoveryComponent.invoker.daemon-14] (rhq.core.pc.util.DiscoveryComponentProxyFactory)- Thread [ResourceDiscoveryComponent.invoker.daemon-14] was interrupted. 2012-03-10 07:13:35,253 ERROR [ResourceDiscoveryComponent.invoker.daemon-15] (rhq.core.pc.util.DiscoveryComponentProxyFactory)- Thread [ResourceDiscoveryComponent.invoker.daemon-15] was interrupted. 2012-03-10 07:13:35,253 ERROR [ResourceDiscoveryComponent.invoker.daemon-16] (rhq.core.pc.util.DiscoveryComponentProxyFactory)- Thread [ResourceDiscoveryComponent.invoker.daemon-16] was interrupted.
priority = high, target 3.1 per bz traige (crouch, foley, loleary)
With more debug enabled I see this: 2012-03-13 01:25:18,024 DEBUG [ResourceDiscoveryComponent.invoker.daemon-742] (org.rhq.core.system.ProcessInfo)- Retrieval of process info for pid 650 took 0 ms. 2012-03-13 01:25:18,024 DEBUG [ResourceDiscoveryComponent.invoker.daemon-742] (org.rhq.core.system.ProcessInfo)- Unexpected error occurred while looking up info for [<unknown>] process with pid [651] - call to getProcExe failed. Did the process die? Cause: org.hyperic.sigar.SigarFileNotFoundException: No such file or directory 2012-03-13 01:25:18,024 DEBUG [ResourceDiscoveryComponent.invoker.daemon-742] (org.rhq.core.system.ProcessInfo)- Retrieval of process info for pid 651 took 0 ms. 2012-03-13 01:25:18,024 DEBUG [ResourceDiscoveryComponent.invoker.daemon-742] (org.rhq.core.system.ProcessInfo)- Unexpected error occurred while looking up info for [<unknown>] process with pid [652] - call to getProcExe failed. Did the process die? Cause: org.hyperic.sigar.SigarFileNotFoundException: No such file or directory ... 2012-03-13 01:25:18,038 DEBUG [ResourceDiscoveryComponent.invoker.daemon-742] (org.rhq.core.system.ProcessInfo)- Retrieval of process info for pid 3013 took 1 ms. 2012-03-13 01:25:18,038 DEBUG [ResourceDiscoveryComponent.invoker.daemon-742] (org.rhq.core.system.ProcessInfo)- Retrieval of process info for pid 3825 took 0 ms. 2012-03-13 01:25:18,038 DEBUG [ResourceDiscoveryComponent.invoker.daemon-742] (org.rhq.core.system.ProcessInfo)- Retrieval of process info for pid 3828 took 0 ms. 2012-03-13 01:25:18,038 DEBUG [ResourceDiscoveryComponent.invoker.daemon-742] (org.rhq.core.system.ProcessInfo)- Retrieval of process info for pid 3837 took 0 ms. 2012-03-13 01:25:18,039 DEBUG [ResourceDiscoveryComponent.invoker.daemon-742] (org.rhq.core.system.ProcessInfo)- Retrieval of process info for pid 3850 took 1 ms. 2012-03-13 01:25:18,039 DEBUG [ResourceDiscoveryComponent.invoker.daemon-742] (org.rhq.core.system.ProcessInfo)- Retrieval of process info for pid 3859 took 0 ms. 2012-03-13 01:25:18,039 DEBUG [ResourceDiscoveryComponent.invoker.daemon-742] (org.rhq.core.system.ProcessInfo)- Retrieval of process info for pid 3885 took 0 ms. 2012-03-13 01:25:18,039 DEBUG [ResourceDiscoveryComponent.invoker.daemon-742] (org.rhq.core.system.ProcessInfo)- Retrieval of process info for pid 3948 took 0 ms. 2012-03-13 01:25:18,040 DEBUG [ResourceDiscoveryComponent.invoker.daemon-742] (org.rhq.core.system.ProcessInfo)- Retrieval of process info for pid 3956 took 1 ms. 2012-03-13 01:25:18,040 DEBUG [ResourceDiscoveryComponent.invoker.daemon-742] (org.rhq.core.system.ProcessInfo)- Retrieval of process info for pid 3984 took 0 ms. rocess with pid [4077] - call to getProcExe failed. Did the process die? Cause: org.hyperic.sigar.SigarFileNotFoundException: No such file or directory ... etc etc I think this could be related or the cause. Also Note: This is a 5.3/5.4 plugin running on a 5.2 agent.
I have not been able to replicate this. Looking at the logs, it appears the retrieval of the MBean metadata took a long time and that caused the timeout. The call into the discovery component is done asynchronously in a thread pool to place a timeout on the call - in this case, it appears the timeout is 60 seconds.
Wow, I haven't looked closely at EmbeddedJMXServerDiscoveryComponent in a long time. I think it is doing a lot of stuff it should not be, including the getNativeProcess() call in order to look for the jmx remoting sysprop. EmbeddedJMXServerDiscoveryComponent's parent component is already a JMXComponent. As far as I'm concerned, it should -always- just delegate the parent component's EMSConnection to do its JMX stuff. It should not be doing any JMX server discovery of its own. This appears to have regressed way back in 2008.
This is a manifestation of two problems: 1) ResourceContext.getNativeProcess() is incredibly inefficient - it often triggers unnecessary discovery scans, which can take some time (I'm not surprised you hit the 60s discovery facet timeout. We already have a BZ for this issue (bug 819521), which I'm hoping we can address in RHQ 4.5. 2) As I mentioned in my previous comment, EmbeddedJMXServerDiscoveryComponent should not be doing any JVM command line parsing or JMX remoting connections period - instead it should just delegate to its parent component's EMSConnection. I am going to use this BZ to track this issue (I've updated the BZ title to reflect that).
Commits went into master to address this BZ http://git.fedorahosted.org/git/?p=rhq/rhq.git;a=commit;h=b3019a68a4fa0a872ea6516c01ca4e17a8e69c91 I'm not sure if this is complete or whether more work is still needed here.
Sounds like a candidate to be explicitly mentioned in the release notes.
It turns out more work was needed. There was a logic bug that pretty much broke discovery of embedded JVM's, which unfortunately made it into the 4.4 release. http://git.fedorahosted.org/git?p=rhq/rhq.git;a=commitdiff;h=569559b fixes this bug in master.
It turns out the faulty fix did *not* go into 4.4 :) I am reopening this, as I may want to do a bit more work on it, to make it as flexible as possible.
Reassigning to Charles to handle, since this is a customer case and a high priority.
Setting out of ON_DEV state due to inactivity, and asking Elias if this can be closed...
I haven't seen this issue in newer versions but I'm not clear what was actually done here. My original complaint was about discovery timing out because of the call to 'getNativeProcess'. Also, it seemed that InterruptedException would continue to repeat (was not cleared.) Feel free to close, however.
Thanks Elias, between what I see here, and what I see in Bug 823704, I think both can be closed.