Bug 1003797

Summary: Failed to start component for RHQ Server Subsystems
Product: [Other] RHQ Project Reporter: Ilya Maleev <imaleev>
Component: PluginsAssignee: Nobody <nobody>
Status: ON_QA --- QA Contact:
Severity: low Docs Contact:
Priority: unspecified    
Version: 4.9CC: genman, hrupp, mazz
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
: 1005708 (view as bug list) Environment:
Last Closed: Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1005708    
Attachments:
Description Flags
agent.log none

Description Ilya Maleev 2013-09-03 09:06:43 UTC
Created attachment 793073 [details]
agent.log

Sometimes resources under the RHQ Server Subsystem become unavailable:
Alert, Communications, Group Definition, Measurement, Remote API

In the agent.log the following exception keeps on coming

2013-08-29 09:49:27,311 ERROR [InventoryManager.discovery-1]
(rhq.core.pc.inventory.InventoryManager)- Failed to start component for Resource[id=11660,
uuid=4c3dedf1-8311-48dc-8484-3903b90dcef1, type={RHQServer}RHQ Server Remote API Subsystem,
key=rhq.remoting:type=RemoteApiMetrics, name=Remote API Subsystem, parent=RHQ Server Subsystems]
from synchronized merge.
org.rhq.core.clientapi.agent.PluginContainerException: Failed to start component for resource
Resource[id=11660, uuid=4c3dedf1-8311-48dc-8484-3903b90dcef1, type={RHQServer}RHQ Server Remote API
Subsystem, key=rhq.remoting:type=RemoteApiMetrics, name=Remote API Subsystem, parent=RHQ Server
Subsystems].
at org.rhq.core.pc.inventory.InventoryManager.activateResource(InventoryManager.java:1831)
at
org.rhq.core.pc.inventory.InventoryManager.refreshResourceComponentState(InventoryManager.java:3226)
at org.rhq.core.pc.inventory.InventoryManager.processSyncInfo(InventoryManager.java:2815)
at org.rhq.core.pc.inventory.InventoryManager.processSyncInfo(InventoryManager.java:2821)
at org.rhq.core.pc.inventory.InventoryManager.processSyncInfo(InventoryManager.java:2821)
at org.rhq.core.pc.inventory.InventoryManager.processSyncInfo(InventoryManager.java:2821)
at org.rhq.core.pc.inventory.InventoryManager.synchInventory(InventoryManager.java:1145)
at org.rhq.core.pc.inventory.InventoryManager.synchInventory(InventoryManager.java:1115)
at org.rhq.core.pc.inventory.InventoryManager.handleReport(InventoryManager.java:1097)
at
org.rhq.core.pc.inventory.RuntimeDiscoveryExecutor.call(RuntimeDiscoveryExecutor.java:129)
at org.rhq.core.pc.inventory.RuntimeDiscoveryExecutor.call(RuntimeDiscoveryExecutor.java:64)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
at java.util.concurrent.FutureTask.run(FutureTask.java:166)
at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolE
xecutor.java:178)
at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor
.java:292)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
at java.lang.Thread.run(Thread.java:722)
Caused by: org.rhq.core.pc.inventory.TimeoutException: [Warning] Call to
[org.rhq.plugins.server.RemoteAPIResourceComponent.start()] with args
[[org.rhq.core.pluginapi.inventory.ResourceContext@71903dd9]] timed out after 60000 milliseconds -
invocation thread will be interrupted.
at
org.rhq.core.clientapi.agent.PluginContainerException.wrapIfNecessary(PluginContainerException.java:
69)
at
org.rhq.core.clientapi.agent.PluginContainerException.<init>(PluginContainerException.java:96)
... 18 more

Comment 1 Heiko W. Rupp 2013-09-03 10:15:56 UTC
Probably a dup / related to Bug 997601

Comment 2 John Mazzitelli 2013-10-07 18:31:16 UTC
I tried to fix this by re-building the connection on error. However, now I'm seeing this, as reported in BZ against JBoss EAP - bug #900595

Comment 3 John Mazzitelli 2013-10-07 18:36:53 UTC
In MBeanResourceComponent in the JMX plugin:

    private boolean isMBeanAvailable() {
        EmsBean emsBean = getEmsBean();
        boolean isAvailable = emsBean.isRegistered();

that isRegistered() call is what throws the exception.

We need to get the connection to re-establish itself. However, there appear to be bugs in JBossAS/EAP in this area that doesn't make that trivial.

I tried this in JBossAS7JMXComponent, but I then hit bug #900595 :

     @Override
     public AvailabilityType getAvailability() {
 
-        // TODO (jshaughn): Figure out why this hangs, it seems so innocuous :)
-        //EmsConnection conn = getEmsConnection();
-        //ConnectionProvider connectionProvider = (null != conn) ? conn.getConnectionProvider() : null;
-        //return (null != connectionProvider && connectionProvider.isConnected()) ? AvailabilityType.UP
-        //    : AvailabilityType.DOWN;
-
-        return AvailabilityType.UP;
+        try {
+            EmsConnection conn = getEmsConnection();
+            if (conn == null) {
+                return AvailabilityType.DOWN;
+            }
+            conn.queryBeans("java.lang:*"); // just make a request over the connection to make sure its valid
+            return AvailabilityType.UP;
+        } catch (Throwable t) {
+            try {
+                this.connection.close(); // try to clean up
+            } catch (Throwable ignore) {
+            }
+            this.connection = null;
+            return AvailabilityType.DOWN;
+        }

Comment 4 John Mazzitelli 2013-10-07 18:38:26 UTC
*** Bug 997601 has been marked as a duplicate of this bug. ***

Comment 5 John Mazzitelli 2013-10-07 19:18:43 UTC
Interesting. I let the server and agent just run for a while and eventually the resources went green! I don't know if there is some sort of timeout in teh JBossAS client code that eventually cleans up bad connections or what, but it did seem to self-correct, but not in a timely fashion. However, it DID clean up.

I'm gonna run another test and watch this carefully to see when exactly it self-corrects.

Comment 6 John Mazzitelli 2013-10-07 20:10:59 UTC
ok, this fix looks like its working. But because the availability checks for these resources (since they are services) occur every 10 minutes, that's why it takes a while for the resource to go green again.

So it appears we do have this workaround for the JBossAS client bugs - I'll commit this code to master soon.

Comment 7 John Mazzitelli 2013-10-07 21:20:58 UTC
git commit to master: 1558f53