Description of problem: See this thread: https://community.jboss.org/thread/221879?tstart=0 In short, the RHQ server attempts to *synchronously* reach the agent using JBoss remoting even though the agent is down, and in fact the IP address may be unreachable. It may take minutes before the server gives up and allows the resource to be uninventoried. In the meantime, there are several database updates and row locks being held which could cause issues. Version-Release number of selected component (if applicable): 4.5.1 Steps to Reproduce: 1. Start an agent on a remote IP address 2. Add it 3. Then shut down that IP (make it unreachable) 4. Attempt to uninventory that resource Additional info: A couple of fixes from easy to complex: 1) Reduce the communication timeout from minutes to, say, 10 seconds. Yes, there could be inconsistencies but probably just a few more than before. 2) Examine the agent state. If it isn't connected, then simply uninventory the resource without sending a request 3) First update the database table. Then use a MDB to handle the call to the agent using whatever timeout you prefer. Basically, this is the 'eventually consistent' fix.
Elias, ( ia commented on the thread ) In 4.8/ master I saw an immediate return and no waiting -- do you have a chance to try that?
The test case would be to have the agent belong to an address no longer reachable, basically an IP that hangs when you ssh to the machine. I can retest with 4.8.
I've seen this issue with agents that are up but not connecting (like they are no longer registered to the server with the right key.) 2013-07-22 22:15:17,719 ERROR [org.rhq.enterprise.communications.command.client.ClientCommandSenderTask] {ClientCommandSenderTask.send-failed} Failed to send command [Command: type=[remotepojo]; cmd-in-response=[false]; config=[{rhq.send-throttle=true}]; params=[{invocation=NameBasedI nvocation[uninventoryResource], targetInterfaceName=org.rhq.core.clientapi.agent.discovery.DiscoveryAgentService}]]. Cause: java.util.concurre nt.TimeoutException:null. Cause: java.util.concurrent.TimeoutException 2013-07-22 22:15:17,720 WARN [org.rhq.enterprise.server.resource.ResourceManagerBean] Unable to inform agent of inventory removal for resour ce [218272] java.lang.reflect.UndeclaredThrowableException at $Proxy22530.uninventoryResource(Unknown Source) at org.rhq.enterprise.server.resource.ResourceManagerBean.uninventoryResource(ResourceManagerBean.java:368) at org.rhq.enterprise.server.resource.ResourceManagerBean.uninventoryResources(ResourceManagerBean.java:259) ... Caused by: java.util.concurrent.TimeoutException at java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:228) at java.util.concurrent.FutureTask.get(FutureTask.java:91) at org.rhq.enterprise.communications.command.client.ClientCommandSenderTask.run(ClientCommandSenderTask.java:143) at org.rhq.enterprise.communications.command.client.ClientCommandSender.sendSynch(ClientCommandSender.java:647) at org.rhq.enterprise.communications.command.client.ClientRemotePojoFactory$RemotePojoProxyHandler.invoke(ClientRemotePojoFactory.java:407) ... 127 more
diff --git a/modules/enterprise/server/jar/src/main/java/org/rhq/enterprise/server/resource/ResourceManagerBean.java b/modules/enterprise/server/jar/src/main/java/org/rhq/enterpr index 3b65ded..9e133b4 100644 --- a/modules/enterprise/server/jar/src/main/java/org/rhq/enterprise/server/resource/ResourceManagerBean.java +++ b/modules/enterprise/server/jar/src/main/java/org/rhq/enterprise/server/resource/ResourceManagerBean.java @@ -437,7 +437,10 @@ public void uninventoryAllResourcesByAgent(Subject user, Agent doomedAgent) { if (agentClient.getAgent() == null || agentClient.getAgent().getName() == null || !agentClient.getAgent().getName().startsWith(ResourceHandlerBean.DUMMY_AGENT_NAME_PREFIX)) { // don't do that on "REST-agents" try { - agentClient.getDiscoveryAgentService().uninventoryResource(resourceId); + boolean agentPing = agentClient.ping(5000L); + if (agentPing) { + agentClient.getDiscoveryAgentService().uninventoryResource(resourceId); + } } catch (Exception e) { log.warn(" Unable to inform agent of inventory removal for resource [" + resourceId + "]", e); } Wouldn't something like this be better?
One simple case is to have an agent started that can't register (wrong token) and try to uninventory it from the server. The UI will hang. When the agent is shut down again, the system is unstuck. I've seen uninventory in many cases causing the database to hang, for example going to Inventory -> resources/servers/platforms screens, nothing will display at all. Then what happens is availability reports are lost and agents are marked as down. This is really problematic.
I've seen this pattern elsewhere in the code, so I'm guessing it is fine here. I have a version (not tested) using JMS/MDB to do the uninventory in a new transaction, which really is preferable (especially if the agent is alive but does hang), but I can probably live with the ping. (JMS is sort of unwieldy but the better design.) One caveat with a ping: If the agent is listening (but with the wrong token) the ping succeeds and then the uninventory hangs.
Created attachment 864860 [details] 2 second timeout for uninventory Not sure if 2 seconds is too long or too short.
I'm not sure we want to put a timeout on the service itself. That then gets applied to the remote calls it makes. If the uninventory actually was working but took more than 2s it would time out. I'd rather maybe we go after the other issues here: Make sure the agent kills itself if it has a bad token. And not use ping() as a check for agent live-ness since it's only checking whether the comm layer is up, but not that the agent is actually servicing requests. Perhaps we add something more like: if ( agentClient.getPingAgentService(2000L).ping() ) { agentClient.getDiscoveryAgentService().uninventoryResource(resourceId); } use a new service that offers a ping but that goes through the proxy code that ensures that agent is really ready to service requests.
The issue with the rouge agent (not killing itself) is Bug 987628, which I have a fix for. The problem is the ping will actually come back if the agent is running but may not be in fact responding to an uninventory. Maybe fixing Bug 987628 will make more sense, then? commit d6950dfd6cb4b5ac7c77b8248e12eb739bc6895a Author: Elias Ross <elias_ross> Date: Mon Dec 2 14:06:54 2013 -0800 BZ 918207 - resource manager; sometimes hangs on uninventory Set timeout on uninventory method to 5 seconds diff --git a/modules/enterprise/server/jar/src/main/java/org/rhq/enterprise/server/resource/ResourceManagerBean.java b/modules/enterprise/server/jar/src/main/ index 6bb586e..d6e887d 100644 --- a/modules/enterprise/server/jar/src/main/java/org/rhq/enterprise/server/resource/ResourceManagerBean.java +++ b/modules/enterprise/server/jar/src/main/java/org/rhq/enterprise/server/resource/ResourceManagerBean.java @@ -453,7 +453,7 @@ public void uninventoryAllResourcesByAgent(Subject user, Agent doomedAgent) { if (agentClient.getAgent() == null || agentClient.getAgent().getName() == null || !agentClient.getAgent().getName().startsWith(ResourceHandlerBean.DUMMY_AGENT_NAME_PREFIX)) { // don't do that on "REST-agents" try { - agentClient.getDiscoveryAgentService().uninventoryResource(resourceId); + agentClient.getDiscoveryAgentService(5000L).uninventoryResource(resourceId); } catch (Exception e) { log.warn(" Unable to inform agent of inventory removal for resource [" + resourceId + "]", e); }
Right, I see the attraction of the change above but it exposes us to a timeout of a properly (albeit slow) uninventory request. The suggested new ping service above would let us ensure the agent was actually responding to service requests and didn't just have an active comm link. So, where the current ping would succeed if the agent was just running but dead in the water, the one above would timeout in 2s. If it returned successfully then we could call uninventory with no timeout override. I think mazz is reviewing the agent patches for Bug 987628 and I'm going to put this enhanced ping in place...
Elias, hopefully these commits for 4.10 do what you're looking for (and possibly anything done for Bug Bug 987628). commit 32a59632e1eee8b2dc2d3c763416f39d2d1fec4a Author: Jay Shaughnessy <jshaughn> Date: Wed Mar 5 13:44:38 2014 -0500 There are two underlying issues in this BZ. First, agent notifications for uninventory are not protected via a short ping check. And second, our current AgentClient.ping() returns true if a comm link can be established, even if the agent is not accepting service requests. Since by default we don't timeout service requests, service requests can wait indefinitely for an agent that is failing to connect. This commit adds a new AgentClient.servicePing(long) which returns true only if the agent can be reached, is accepting service requests, and returns before the supplied timeout expires. It also adds a call to this in the uninventory SLSB, as well as converts existing ping() calls to pingService(). commit c8f4c35a4b956fb15ff85ecf6feed74c46be4887 Author: John Mazzitelli <mazz> Date: Wed Mar 5 14:31:11 2014 -0500 since we have the information, let's use it to proactively warn the user that an agent's clock is probably skewed
Additional commit in master commit d8e07cc0b678e513e97d6321a1f6095f0b3a576c Author: Thomas Segismont <tsegismo> Date: Fri Mar 7 17:24:04 2014 +0100 Add missing agent service definition in org.rhq.core.pc.PluginContainer#services
Bulk closing of 4.10 issues. If an issue is not solved for you, please open a new BZ (or clone the existing one) with a version designator of 4.10.