Bug 537396

Summary:

Agent out of memory with deadlock and thread runaway

Product:

[Other] RHQ Project

Reporter:

Greg Hinkle <ghinkle>

Component:

Agent

Assignee:

John Mazzitelli <mazz>

Status:

CLOSED CURRENTRELEASE

QA Contact:

Sunil Kondkar <skondkar>

Severity:

high

Docs Contact:

Priority:

medium

Version:

1.4

CC:

mazz

Target Milestone:

---

Keywords:

SubBug

Target Release:

---

Hardware:

All

OS:

All

Whiteboard:

Fixed In Version:

2.4

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2010-08-12 16:45:48 UTC

Type:

---

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

565628

Attachments:

Description	Flags
patch to timeout if the lock acquisition takes too long	none
byteman.jar	none
AGENT_SETUP.sh	none
agent-byteman.rules	none

Description Greg Hinkle 2009-11-13 13:25:56 UTC

Description of problem:

I ran into this on an EC2 instance. The agent has been running for a couple weeks and it has hosed the entire box. The JVM process has over 8,000 native threads. The thread dump is 12 megs.

They seem to pretty much all be ClientCommandSenderTask threads and the JVM did detect a deadlock. This is the JON 2.3GA build plus some small unrelated server patches. 

The java version is:
java version "1.7.0"
IcedTea Runtime Environment (build 1.7.0-b21)
IcedTea Client VM (build 1.7.0-b21, mixed mode)




An example of the 8k threads:

"ClientCommandSenderTask Timer Thread #33684" daemon prio=10 tid=0x0f292800 nid=0x2f3b waiting for monitor entry [0x12afa000..0x12afae84]
   java.lang.Thread.State: BLOCKED (on object monitor)
        at org.rhq.enterprise.communications.command.client.JBossRemotingRemoteCommunicator.invokeInitializeCallbackIfNeeded(JBossRemotingRemoteCommunicator.java:539)
        - waiting to lock <0xaca58ac8> (a [Z)
        at org.rhq.enterprise.communications.command.client.JBossRemotingRemoteCommunicator.send(JBossRemotingRemoteCommunicator.java:466)
        at org.rhq.enterprise.communications.command.client.AbstractCommandClient.invoke(AbstractCommandClient.java:143)
        at org.rhq.enterprise.communications.command.client.ClientCommandSender.send(ClientCommandSender.java:1087)
        at org.rhq.enterprise.communications.command.client.ClientCommandSenderTask.send(ClientCommandSenderTask.java:229)
        at org.rhq.enterprise.communications.command.client.ClientCommandSenderTask.call(ClientCommandSenderTask.java:107)
        at org.rhq.enterprise.communications.command.client.ClientCommandSenderTask.call(ClientCommandSenderTask.java:55)
        at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
        at java.util.concurrent.FutureTask.run(FutureTask.java:166)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
        at java.lang.Thread.run(Thread.java:675)










This is what the JVM had to say about the deadlock:



JNI global references: 856


Found one Java-level deadlock:
=============================
"ClientCommandSenderTask Timer Thread #34235":
  waiting to lock monitor 0x0ac31118 (object 0xaca58ac8, a [Z),
  which is held by "ClientCommandSenderTask Timer Thread #25705""ClientCommandSenderTask Timer Thread #25705":
  waiting to lock monitor 0x0aaa79d0 (object 0xac9fa090, a [J),  which is held by "ClientCommandSenderTask Timer Thread #25688"
"ClientCommandSenderTask Timer Thread #25688":
  waiting for ownable synchronizer 0xaca18bb8, (a java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync),
  which is held by "ClientCommandSenderTask Timer Thread #25705"

Java stack information for the threads listed above:
===================================================
"ClientCommandSenderTask Timer Thread #34235":
        at org.rhq.enterprise.communications.command.client.JBossRemotingRemoteCommunicator.invokeInitializeCallbackIfNeeded(JBossRemotingRemoteCommu
nicator.java:539)
        - waiting to lock <0xaca58ac8> (a [Z)
        at org.rhq.enterprise.communications.command.client.JBossRemotingRemoteCommunicator.send(JBossRemotingRemoteCommunicator.java:466)
        at org.rhq.enterprise.communications.command.client.AbstractCommandClient.invoke(AbstractCommandClient.java:143)
        at org.rhq.enterprise.communications.command.client.ClientCommandSender.send(ClientCommandSender.java:1087)
        at org.rhq.enterprise.communications.command.client.ClientCommandSenderTask.send(ClientCommandSenderTask.java:229)
        at org.rhq.enterprise.communications.command.client.ClientCommandSenderTask.call(ClientCommandSenderTask.java:107)
        at org.rhq.enterprise.communications.command.client.ClientCommandSenderTask.call(ClientCommandSenderTask.java:55)
        at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
        at java.util.concurrent.FutureTask.run(FutureTask.java:166)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
        at java.lang.Thread.run(Thread.java:675)
"ClientCommandSenderTask Timer Thread #25705":
        at org.rhq.enterprise.agent.AgentMain.failoverToNewServer(AgentMain.java:1823)
        - waiting to lock <0xac9fa090> (a [J)
        at org.rhq.enterprise.agent.FailoverFailureCallback.failureDetected(FailoverFailureCallback.java:104)
        at org.rhq.enterprise.communications.command.client.JBossRemotingRemoteCommunicator.invokeFailureCallbackIfNeeded(JBossRemotingRemoteCommunicator.java:580)
        at org.rhq.enterprise.communications.command.client.JBossRemotingRemoteCommunicator.sendWithoutInitializeCallback(JBossRemotingRemoteCommunicator.java:453)
        at org.rhq.enterprise.agent.AgentMain.sendConnectRequestToServer(AgentMain.java:1943)
        at org.rhq.enterprise.agent.ConnectAgentInitializeCallback.sendingInitialCommand(ConnectAgentInitializeCallback.java:43)
        at org.rhq.enterprise.communications.command.client.JBossRemotingRemoteCommunicator.invokeInitializeCallbackIfNeeded(JBossRemotingRemoteCommunicator.java:542)
        - locked <0xaca58ac8> (a [Z)
        at org.rhq.enterprise.communications.command.client.JBossRemotingRemoteCommunicator.send(JBossRemotingRemoteCommunicator.java:466)
        at org.rhq.enterprise.communications.command.client.AbstractCommandClient.invoke(AbstractCommandClient.java:143)
        at org.rhq.enterprise.communications.command.client.ClientCommandSender.send(ClientCommandSender.java:1087)
        at org.rhq.enterprise.communications.command.client.ClientCommandSenderTask.send(ClientCommandSenderTask.java:229)
        at org.rhq.enterprise.communications.command.client.ClientCommandSenderTask.call(ClientCommandSenderTask.java:107)
        at org.rhq.enterprise.communications.command.client.ClientCommandSenderTask.call(ClientCommandSenderTask.java:55)
        at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
        at java.util.concurrent.FutureTask.run(FutureTask.java:166)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
        at java.lang.Thread.run(Thread.java:675)
"ClientCommandSenderTask Timer Thread #25688":
        at sun.misc.Unsafe.park(Native Method)
        - parking to wait for  <0xaca18bb8> (a java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync)
        at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:226)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireNanos(AbstractQueuedSynchronizer.java:864)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireNanos(AbstractQueuedSynchronizer.java:1178)
        at java.util.concurrent.locks.ReentrantReadWriteLock$WriteLock.tryLock(ReentrantReadWriteLock.java:1009)
        at org.rhq.enterprise.agent.AgentMain.sendConnectRequestToServer(AgentMain.java:1927)
        at org.rhq.enterprise.agent.AgentMain.switchCommServer(AgentMain.java:1880)
        at org.rhq.enterprise.agent.AgentMain.failoverToNewServer(AgentMain.java:1838)
        - locked <0xac9fa090> (a [J)
        at org.rhq.enterprise.agent.FailoverFailureCallback.failureDetected(FailoverFailureCallback.java:104)
        at org.rhq.enterprise.communications.command.client.JBossRemotingRemoteCommunicator.invokeFailureCallbackIfNeeded(JBossRemotingRemoteCommunicator.java:580)
        at org.rhq.enterprise.communications.command.client.JBossRemotingRemoteCommunicator.sendWithoutInitializeCallback(JBossRemotingRemoteCommunicator.java:453)
        at org.rhq.enterprise.communications.command.client.JBossRemotingRemoteCommunicator.send(JBossRemotingRemoteCommunicator.java:471)
        at org.rhq.enterprise.communications.command.client.AbstractCommandClient.invoke(AbstractCommandClient.java:143)
        at org.rhq.enterprise.communications.command.client.ClientCommandSender.send(ClientCommandSender.java:1087)
        at org.rhq.enterprise.communications.command.client.ClientCommandSenderTask.send(ClientCommandSenderTask.java:229)
        at org.rhq.enterprise.communications.command.client.ClientCommandSenderTask.call(ClientCommandSenderTask.java:107)
        at org.rhq.enterprise.communications.command.client.ClientCommandSenderTask.call(ClientCommandSenderTask.java:55)
        at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
        at java.util.concurrent.FutureTask.run(FutureTask.java:166)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
        at java.lang.Thread.run(Thread.java:675)

Found 1 deadlock.

Heap
 def new generation   total 9088K, used 1311K [0xabfc0000, 0xac990000, 0xac990000)
  eden space 8128K,   9% used [0xabfc0000, 0xac084fe0, 0xac7b0000)
  from space 960K,  54% used [0xac8a0000, 0xac922dd0, 0xac990000)
  to   space 960K,   0% used [0xac7b0000, 0xac7b0000, 0xac8a0000)
 tenured generation   total 121024K, used 116706K [0xac990000, 0xb3fc0000, 0xb3fc0000)
   the space 121024K,  96% used [0xac990000, 0xb3b88bb8, 0xb3b88c00, 0xb3fc0000)
 compacting perm gen  total 30464K, used 30250K [0xb3fc0000, 0xb5d80000, 0xb7fc0000)
   the space 30464K,  99% used [0xb3fc0000, 0xb5d4aa20, 0xb5d4ac00, 0xb5d80000)
No shared spaces configured.

Comment 1 John Mazzitelli 2009-11-13 16:03:49 UTC

do you see this message in the agent logs (probably alot of times):

"The initialize callback has failed. It will be tried again. Cause: {0}"

That would mean the "connect" message is failing to go out and everything else will block behind it.

This stuff is gonna block until that connect message is successful. It's blocked here:

synchronized (m_needToCallInitializeCallback) {

see:

org.rhq.enterprise.communications.command.client.JBossRemotingRemoteCommunicator.invokeInitializeCallbackIfNeeded(Command)

"If there is an initialize callback set, this method will block all callers until the callback has been invoked."

I guess we need to have a timeout in here and abort after some large amount of time passes? I'll need to refactor this to use some java.util.concurrent class, right now its just using the synchronized keyword and that's what is blocking. I'll need another way to block and timeout.

This is deep in the bowels of this comm stuff, will need to be careful mucking with this to make sure its done correctly.

I still say it is because the server was down or otherwise not accepting messages from the agent. If a server is down for a long time, this will happen.

This code has been localized pretty well. We can refactor this to use a java.util.concurrent lock and have it timeout after some really long time (just to have it avoid blocking forever - I'm thinking like an hour timeout or something on that magnitude - give the server time to come back up - not too long to crap out your box like you are seeing but not too short to circumvent the "real" timeouts we normally have in our messages. I'll make this long timeout configurable :)

Comment 2 Charles Crouch 2009-12-18 21:31:00 UTC

Should be triaged for 1.3.1

Comment 3 wes hayutin 2010-02-16 16:53:50 UTC

Temporarily adding the keyword "SubBug" so we can be sure we have accounted for all the bugs.

keyword:
new = Tracking + FutureFeature + SubBug

Comment 4 wes hayutin 2010-02-16 16:59:02 UTC

making sure we're not missing any bugs in rhq_triage

Comment 5 John Mazzitelli 2010-02-19 22:36:43 UTC

Created attachment 395191 [details]
patch to timeout if the lock acquisition takes too long

Comment 6 John Mazzitelli 2010-02-19 22:46:46 UTC

opps... need to make sure we ignore the lock failure if the init callback doesn't need to be invoked:

diff --git a/modules/enterprise/comm/src/main/java/org/rhq/enterprise/communications/command/client/JBossRemotingRemoteCommunicator.java b/modules/enterprise/comm/src/main/java/org/rhq/enterprise/communications/command/client/JBossRemotingRemoteCommunicator.java
index 762b9b4..8c37b50 100644
--- a/modules/enterprise/comm/src/main/java/org/rhq/enterprise/communications/command/client/JBossRemotingRemoteCommunicator.java
+++ b/modules/enterprise/comm/src/main/java/org/rhq/enterprise/communications/command/client/JBossRemotingRemoteCommunicator.java
@@ -588,10 +588,11 @@ public class JBossRemotingRemoteCommunicator implements RemoteCommunicator {
                     writeLock.unlock();
                 }
             } else {
-                m_needToCallInitializeCallback = true; // can't invoke callback, we'll want to still call it later
-                Throwable t = new Throwable("Initialize callback lock could not be acquired");
-                LOG.error(CommI18NResourceKeys.INITIALIZE_CALLBACK_FAILED, t.getMessage());
-                return new GenericCommandResponse(command, false, null, t);
+                if (m_needToCallInitializeCallback) {
+                    Throwable t = new Throwable("Initialize callback lock could not be acquired");
+                    LOG.error(CommI18NResourceKeys.INITIALIZE_CALLBACK_FAILED, t.getMessage());
+                    return new GenericCommandResponse(command, false, null, t);
+                }
             }
         }
         return null;

Comment 7 John Mazzitelli 2010-02-19 22:49:34 UTC

errr... on second thought, perhaps we should always throw an exception on lock acquisition failure.

Comment 8 John Mazzitelli 2010-02-20 02:29:53 UTC

follow commits to master fix this issue:

60fa4314138911d607d6e95110987feb5a7b8fb7
1740b73d4ec9dddf1222fcf99b69a8164c090bbb
e1c760378fe3f3fdcbee42beef05f9930eb3fd65

The lock acquisition timeout defaults to 60 minutes. However, this is configurable if for some reason that isn't good enough. Its only configurable via -D sysprop passed to the agent:

rhq.communications.initial-callback-lock-wait-mins

if not specified, or if the value specified is not a valid number, it will default to 60 minutes.

Comment 9 Corey Welton 2010-03-24 03:02:05 UTC

qa -> skondkar

Comment 10 John Mazzitelli 2010-04-01 19:41:00 UTC

this is a difficult one to test. However, if we use Byteman (http://www.jboss.org/byteman) to artificially cause a deadlock in the code, we can force my lock timeout to trigger (if we see the lock timeout trigger, we know we are not going to infinitely lock in the place where we were before).

First, download the three attachments (byteman.jar and agent-byteman.rules and AGENT_SETUP.sh) in a single directory.

Edit AGENT_SETUP.sh so the first line (setting up RHQ_AGENT_HOME) points to your agent home directory (the rhq-agent directory).

Now run AGENT_SETUP.sh.

When the agent starts, you want to execute the "shutdown" command at the agent prompt. Then execute the "start" command at the agent prompt. You will see the byteman rules output things like "CONNECTING TO THE SERVER" and "BLOCKING!".

Once you get the agent prompt back (it might take a minute before you see the "started>" prompt), you need to next execute the "sender start" agent command at the agent prompt. This forces the agent to start its internal sender (so it thinks its connected to the server).

Let this run for a couple of minutes.

Now go into the agent log file and confirm you see this log message (or things like it). If you see the error message like:

2010-04-01 14:24:18,476 ERROR [ClientCommandSenderTask Timer Thread #4] (enterprise.communications.command.client.JBossRemotingRemoteCommunicator)- {JBossRemotingRemoteCommunicator.init-callback-failed}The initialize callback has failed. It will be tried again. Cause: Initialize callback lock could not be acquired

"Initialize callback lock could not be acquired" means this fix is working. What is happening is we waited for the lock to get acquired but (unlike before) after a certain period of time that wait will be aborted and that error message is spit out.

Comment 11 John Mazzitelli 2010-04-01 19:42:18 UTC

Created attachment 404080 [details]
byteman.jar

Comment 12 John Mazzitelli 2010-04-01 19:42:54 UTC

Created attachment 404081 [details]
AGENT_SETUP.sh

Comment 13 John Mazzitelli 2010-04-01 19:43:24 UTC

Created attachment 404082 [details]
agent-byteman.rules

Comment 14 Sunil Kondkar 2010-04-07 13:28:16 UTC

Thanks for the detailed steps.

Verified on master build#153 (Revision: d5c27afe5b24dca4c64c8e68a13f46a8a5d0dcd0).

Followed the steps as provided in comment 10 and got the following log messages in agent log file.

2010-04-07 17:49:02,292 ERROR [ClientCommandSenderTask Timer Thread #1] (enterprise.communications.command.client.JBossRemotingRemoteCommunicator)- {JBossRemotingRemoteCommunicator.init-callback-failed}The initialize callback has failed. It will be tried again. Cause: Initialize callback lock could not be acquired

2010-04-07 17:51:02,329 ERROR [RHQ Server Polling Thread] (enterprise.communications.command.client.JBossRemotingRemoteCommunicator)- {JBossRemotingRemoteCommunicator.init-callback-failed}The initialize callback has failed. It will be tried again. Cause: Initialize callback lock could not be acquired

Comment 15 Corey Welton 2010-08-12 16:45:48 UTC

Mass-closure of verified bugs against JON.