918207 – Uninventorying resources for agents, on unreachable networks or waiting for registration, causes database hang

Bug 918207 - Uninventorying resources for agents, on unreachable networks or waiting for registration, causes database hang

Summary: Uninventorying resources for agents, on unreachable networks or waiting for r...

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	RHQ Project
Classification:	Other
Component:	Inventory
Sub Component:
Version:	4.5
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	high
Target Milestone:	GA
Target Release:	RHQ 4.10
Assignee:	RHQ Project Maintainer
QA Contact:	Mike Foley
Docs Contact:
URL:	https://community.jboss.org/thread/22...
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2013-03-05 17:39 UTC by Elias Ross
Modified:	2014-04-23 12:31 UTC (History)
CC List:	3 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2014-04-23 12:31:17 UTC
Embargoed:

Attachments	(Terms of Use)
2 second timeout for uninventory (1.56 KB, patch) 2014-02-19 00:17 UTC, Elias Ross	no flags	Details \| Diff
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Bugzilla	1074465	0	unspecified	CLOSED	Agent availability is unknown and NoSuchMethodException is being thrown to the agent log	2021-02-22 00:41:40 UTC

Internal Links: 1074465

Description Elias Ross 2013-03-05 17:39:36 UTC

Description of problem:

See this thread: https://community.jboss.org/thread/221879?tstart=0

In short, the RHQ server attempts to *synchronously* reach the agent using JBoss remoting even though the agent is down, and in fact the IP address may be unreachable.

It may take minutes before the server gives up and allows the resource to be uninventoried. In the meantime, there are several database updates and row locks being held which could cause issues.


Version-Release number of selected component (if applicable):

4.5.1

Steps to Reproduce:
1. Start an agent on a remote IP address
2. Add it
3. Then shut down that IP (make it unreachable)
4. Attempt to uninventory that resource
  

Additional info:

A couple of fixes from easy to complex:
1) Reduce the communication timeout from minutes to, say, 10 seconds. Yes, there could be inconsistencies but probably just a few more than before.
2) Examine the agent state. If it isn't connected, then simply uninventory the resource without sending a request
3) First update the database table. Then use a MDB to handle the call to the agent using whatever timeout you prefer. Basically, this is the 'eventually consistent' fix.

Comment 1 Heiko W. Rupp 2013-07-01 07:55:46 UTC

Elias,
( ia commented on the thread )

In 4.8/ master I saw an immediate return and no waiting -- do you have a chance to try that?

Comment 2 Elias Ross 2013-07-01 16:49:00 UTC

The test case would be to have the agent belong to an address no longer reachable, basically an IP that hangs when you ssh to the machine.

I can retest with 4.8.

Comment 3 Elias Ross 2013-07-22 22:20:43 UTC

I've seen this issue with agents that are up but not connecting (like they are no longer registered to the server with the right key.)

2013-07-22 22:15:17,719 ERROR [org.rhq.enterprise.communications.command.client.ClientCommandSenderTask] {ClientCommandSenderTask.send-failed}
Failed to send command [Command: type=[remotepojo]; cmd-in-response=[false]; config=[{rhq.send-throttle=true}]; params=[{invocation=NameBasedI
nvocation[uninventoryResource], targetInterfaceName=org.rhq.core.clientapi.agent.discovery.DiscoveryAgentService}]]. Cause: java.util.concurre
nt.TimeoutException:null. Cause: java.util.concurrent.TimeoutException
2013-07-22 22:15:17,720 WARN  [org.rhq.enterprise.server.resource.ResourceManagerBean]  Unable to inform agent of inventory removal for resour
ce [218272]
java.lang.reflect.UndeclaredThrowableException
        at $Proxy22530.uninventoryResource(Unknown Source)
        at org.rhq.enterprise.server.resource.ResourceManagerBean.uninventoryResource(ResourceManagerBean.java:368)
        at org.rhq.enterprise.server.resource.ResourceManagerBean.uninventoryResources(ResourceManagerBean.java:259)

...

Caused by: java.util.concurrent.TimeoutException
        at java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:228)
        at java.util.concurrent.FutureTask.get(FutureTask.java:91)
        at org.rhq.enterprise.communications.command.client.ClientCommandSenderTask.run(ClientCommandSenderTask.java:143)
        at org.rhq.enterprise.communications.command.client.ClientCommandSender.sendSynch(ClientCommandSender.java:647)
        at org.rhq.enterprise.communications.command.client.ClientRemotePojoFactory$RemotePojoProxyHandler.invoke(ClientRemotePojoFactory.java:407)
        ... 127 more

Comment 4 Elias Ross 2013-09-16 20:17:41 UTC

diff --git a/modules/enterprise/server/jar/src/main/java/org/rhq/enterprise/server/resource/ResourceManagerBean.java b/modules/enterprise/server/jar/src/main/java/org/rhq/enterpr
index 3b65ded..9e133b4 100644
--- a/modules/enterprise/server/jar/src/main/java/org/rhq/enterprise/server/resource/ResourceManagerBean.java
+++ b/modules/enterprise/server/jar/src/main/java/org/rhq/enterprise/server/resource/ResourceManagerBean.java
@@ -437,7 +437,10 @@ public void uninventoryAllResourcesByAgent(Subject user, Agent doomedAgent) {
                 if (agentClient.getAgent() == null || agentClient.getAgent().getName() == null
                     || !agentClient.getAgent().getName().startsWith(ResourceHandlerBean.DUMMY_AGENT_NAME_PREFIX)) { // don't do that on "REST-agents"
                     try {
-                        agentClient.getDiscoveryAgentService().uninventoryResource(resourceId);
+                        boolean agentPing = agentClient.ping(5000L);
+                        if (agentPing) {
+                            agentClient.getDiscoveryAgentService().uninventoryResource(resourceId);
+                        }
                     } catch (Exception e) {
                         log.warn(" Unable to inform agent of inventory removal for resource [" + resourceId + "]", e);
                     }


Wouldn't something like this be better?

Comment 5 Elias Ross 2013-10-29 21:15:33 UTC

One simple case is to have an agent started that can't register (wrong token) and try to uninventory it from the server. The UI will hang. When the agent is shut down again, the system is unstuck.

I've seen uninventory in many cases causing the database to hang, for example going to Inventory -> resources/servers/platforms screens, nothing will display at all. Then what happens is availability reports are lost and agents are marked as down.

This is really problematic.

Comment 6 Elias Ross 2013-11-14 07:52:40 UTC

I've seen this pattern elsewhere in the code, so I'm guessing it is fine here.

I have a version (not tested) using JMS/MDB to do the uninventory in a new transaction, which really is preferable (especially if the agent is alive but does hang), but I can probably live with the ping. (JMS is sort of unwieldy but the better design.)

One caveat with a ping: If the agent is listening (but with the wrong token) the ping succeeds and then the uninventory hangs.

Comment 7 Elias Ross 2014-02-19 00:17:19 UTC

Created attachment 864860 [details]
2 second timeout for uninventory

Not sure if 2 seconds is too long or too short.

Comment 8 Jay Shaughnessy 2014-02-19 22:57:45 UTC

I'm not sure we want to put a timeout on the service itself. That then gets applied to the remote calls it makes.  If the uninventory actually was working but took more than 2s it would time out.

I'd rather maybe we go after the other issues here: Make sure the agent kills itself if it has a bad token.  And not use ping() as a check for agent live-ness since it's only checking whether the comm layer is up, but not that the agent is actually servicing requests.  Perhaps we add something more like:

if ( agentClient.getPingAgentService(2000L).ping() ) {
  agentClient.getDiscoveryAgentService().uninventoryResource(resourceId);
}

use a new service that offers a ping but that goes through the proxy code that ensures that agent is really ready to service requests.

Comment 9 Elias Ross 2014-02-28 21:50:43 UTC

The issue with the rouge agent (not killing itself) is Bug 987628, which I have a fix for.

The problem is the ping will actually come back if the agent is running but may not be in fact responding to an uninventory. Maybe fixing Bug 987628 will make more sense, then?

commit d6950dfd6cb4b5ac7c77b8248e12eb739bc6895a
Author: Elias Ross <elias_ross>
Date:   Mon Dec 2 14:06:54 2013 -0800

    BZ 918207 - resource manager; sometimes hangs on uninventory
    
    Set timeout on uninventory method to 5 seconds

diff --git a/modules/enterprise/server/jar/src/main/java/org/rhq/enterprise/server/resource/ResourceManagerBean.java b/modules/enterprise/server/jar/src/main/
index 6bb586e..d6e887d 100644
--- a/modules/enterprise/server/jar/src/main/java/org/rhq/enterprise/server/resource/ResourceManagerBean.java
+++ b/modules/enterprise/server/jar/src/main/java/org/rhq/enterprise/server/resource/ResourceManagerBean.java
@@ -453,7 +453,7 @@ public void uninventoryAllResourcesByAgent(Subject user, Agent doomedAgent) {
                 if (agentClient.getAgent() == null || agentClient.getAgent().getName() == null
                     || !agentClient.getAgent().getName().startsWith(ResourceHandlerBean.DUMMY_AGENT_NAME_PREFIX)) { // don't do that on "REST-agents"
                     try {
-                        agentClient.getDiscoveryAgentService().uninventoryResource(resourceId);
+                        agentClient.getDiscoveryAgentService(5000L).uninventoryResource(resourceId);
                     } catch (Exception e) {
                         log.warn(" Unable to inform agent of inventory removal for resource [" + resourceId + "]", e);
                     }

Comment 10 Jay Shaughnessy 2014-03-04 22:09:32 UTC

Right, I see the attraction of the change above but it exposes us to a timeout of a properly (albeit slow) uninventory request.  The suggested new ping service above would let us ensure the agent was actually responding to service requests and didn't just have an active comm link.  So, where the current ping would succeed if the agent was just running but dead in the water, the one above would timeout in 2s. If it returned successfully then we could call uninventory with no timeout override.

I think mazz is reviewing the agent patches for Bug 987628 and I'm going to put this enhanced ping in place...

Comment 11 Jay Shaughnessy 2014-03-05 20:41:16 UTC

Elias, hopefully these commits for 4.10 do what you're looking for (and possibly anything done for Bug Bug 987628).


commit 32a59632e1eee8b2dc2d3c763416f39d2d1fec4a
Author: Jay Shaughnessy <jshaughn>
Date:   Wed Mar 5 13:44:38 2014 -0500

  There are two underlying issues in this BZ. First, agent notifications
  for uninventory are not protected via a short ping check. And second, our
  current AgentClient.ping() returns true if a comm link can be established,
  even if the agent is not accepting service requests.  Since by default we
  don't timeout service requests, service requests can wait indefinitely for
  an agent that is failing to connect.

  This commit adds a new AgentClient.servicePing(long) which returns true
  only if the agent can be reached, is accepting service requests, and returns
  before the supplied timeout expires.  It also adds a call to this in the
  uninventory SLSB, as well as converts existing ping() calls to pingService().

commit c8f4c35a4b956fb15ff85ecf6feed74c46be4887
Author: John Mazzitelli <mazz>
Date:   Wed Mar 5 14:31:11 2014 -0500

  since we have the information, let's use it to proactively warn the user that
  an agent's clock is probably skewed

Comment 12 Thomas Segismont 2014-03-07 16:27:32 UTC

Additional commit in master

commit d8e07cc0b678e513e97d6321a1f6095f0b3a576c
Author: Thomas Segismont <tsegismo>
Date:   Fri Mar 7 17:24:04 2014 +0100

Add missing agent service definition in org.rhq.core.pc.PluginContainer#services

Comment 13 Heiko W. Rupp 2014-04-23 12:31:17 UTC

Bulk closing of 4.10 issues.

If an issue is not solved for you, please open a new BZ (or clone the existing one) with a version designator of 4.10.

Note You need to log in before you can comment on or make changes to this bug.