Description of problem: I ran into this on an EC2 instance. The agent has been running for a couple weeks and it has hosed the entire box. The JVM process has over 8,000 native threads. The thread dump is 12 megs. They seem to pretty much all be ClientCommandSenderTask threads and the JVM did detect a deadlock. This is the JON 2.3GA build plus some small unrelated server patches. The java version is: java version "1.7.0" IcedTea Runtime Environment (build 1.7.0-b21) IcedTea Client VM (build 1.7.0-b21, mixed mode) An example of the 8k threads: "ClientCommandSenderTask Timer Thread #33684" daemon prio=10 tid=0x0f292800 nid=0x2f3b waiting for monitor entry [0x12afa000..0x12afae84] java.lang.Thread.State: BLOCKED (on object monitor) at org.rhq.enterprise.communications.command.client.JBossRemotingRemoteCommunicator.invokeInitializeCallbackIfNeeded(JBossRemotingRemoteCommunicator.java:539) - waiting to lock <0xaca58ac8> (a [Z) at org.rhq.enterprise.communications.command.client.JBossRemotingRemoteCommunicator.send(JBossRemotingRemoteCommunicator.java:466) at org.rhq.enterprise.communications.command.client.AbstractCommandClient.invoke(AbstractCommandClient.java:143) at org.rhq.enterprise.communications.command.client.ClientCommandSender.send(ClientCommandSender.java:1087) at org.rhq.enterprise.communications.command.client.ClientCommandSenderTask.send(ClientCommandSenderTask.java:229) at org.rhq.enterprise.communications.command.client.ClientCommandSenderTask.call(ClientCommandSenderTask.java:107) at org.rhq.enterprise.communications.command.client.ClientCommandSenderTask.call(ClientCommandSenderTask.java:55) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334) at java.util.concurrent.FutureTask.run(FutureTask.java:166) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) at java.lang.Thread.run(Thread.java:675) This is what the JVM had to say about the deadlock: JNI global references: 856 Found one Java-level deadlock: ============================= "ClientCommandSenderTask Timer Thread #34235": waiting to lock monitor 0x0ac31118 (object 0xaca58ac8, a [Z), which is held by "ClientCommandSenderTask Timer Thread #25705""ClientCommandSenderTask Timer Thread #25705": waiting to lock monitor 0x0aaa79d0 (object 0xac9fa090, a [J), which is held by "ClientCommandSenderTask Timer Thread #25688" "ClientCommandSenderTask Timer Thread #25688": waiting for ownable synchronizer 0xaca18bb8, (a java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync), which is held by "ClientCommandSenderTask Timer Thread #25705" Java stack information for the threads listed above: =================================================== "ClientCommandSenderTask Timer Thread #34235": at org.rhq.enterprise.communications.command.client.JBossRemotingRemoteCommunicator.invokeInitializeCallbackIfNeeded(JBossRemotingRemoteCommu nicator.java:539) - waiting to lock <0xaca58ac8> (a [Z) at org.rhq.enterprise.communications.command.client.JBossRemotingRemoteCommunicator.send(JBossRemotingRemoteCommunicator.java:466) at org.rhq.enterprise.communications.command.client.AbstractCommandClient.invoke(AbstractCommandClient.java:143) at org.rhq.enterprise.communications.command.client.ClientCommandSender.send(ClientCommandSender.java:1087) at org.rhq.enterprise.communications.command.client.ClientCommandSenderTask.send(ClientCommandSenderTask.java:229) at org.rhq.enterprise.communications.command.client.ClientCommandSenderTask.call(ClientCommandSenderTask.java:107) at org.rhq.enterprise.communications.command.client.ClientCommandSenderTask.call(ClientCommandSenderTask.java:55) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334) at java.util.concurrent.FutureTask.run(FutureTask.java:166) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) at java.lang.Thread.run(Thread.java:675) "ClientCommandSenderTask Timer Thread #25705": at org.rhq.enterprise.agent.AgentMain.failoverToNewServer(AgentMain.java:1823) - waiting to lock <0xac9fa090> (a [J) at org.rhq.enterprise.agent.FailoverFailureCallback.failureDetected(FailoverFailureCallback.java:104) at org.rhq.enterprise.communications.command.client.JBossRemotingRemoteCommunicator.invokeFailureCallbackIfNeeded(JBossRemotingRemoteCommunicator.java:580) at org.rhq.enterprise.communications.command.client.JBossRemotingRemoteCommunicator.sendWithoutInitializeCallback(JBossRemotingRemoteCommunicator.java:453) at org.rhq.enterprise.agent.AgentMain.sendConnectRequestToServer(AgentMain.java:1943) at org.rhq.enterprise.agent.ConnectAgentInitializeCallback.sendingInitialCommand(ConnectAgentInitializeCallback.java:43) at org.rhq.enterprise.communications.command.client.JBossRemotingRemoteCommunicator.invokeInitializeCallbackIfNeeded(JBossRemotingRemoteCommunicator.java:542) - locked <0xaca58ac8> (a [Z) at org.rhq.enterprise.communications.command.client.JBossRemotingRemoteCommunicator.send(JBossRemotingRemoteCommunicator.java:466) at org.rhq.enterprise.communications.command.client.AbstractCommandClient.invoke(AbstractCommandClient.java:143) at org.rhq.enterprise.communications.command.client.ClientCommandSender.send(ClientCommandSender.java:1087) at org.rhq.enterprise.communications.command.client.ClientCommandSenderTask.send(ClientCommandSenderTask.java:229) at org.rhq.enterprise.communications.command.client.ClientCommandSenderTask.call(ClientCommandSenderTask.java:107) at org.rhq.enterprise.communications.command.client.ClientCommandSenderTask.call(ClientCommandSenderTask.java:55) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334) at java.util.concurrent.FutureTask.run(FutureTask.java:166) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) at java.lang.Thread.run(Thread.java:675) "ClientCommandSenderTask Timer Thread #25688": at sun.misc.Unsafe.park(Native Method) - parking to wait for <0xaca18bb8> (a java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync) at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:226) at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireNanos(AbstractQueuedSynchronizer.java:864) at java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireNanos(AbstractQueuedSynchronizer.java:1178) at java.util.concurrent.locks.ReentrantReadWriteLock$WriteLock.tryLock(ReentrantReadWriteLock.java:1009) at org.rhq.enterprise.agent.AgentMain.sendConnectRequestToServer(AgentMain.java:1927) at org.rhq.enterprise.agent.AgentMain.switchCommServer(AgentMain.java:1880) at org.rhq.enterprise.agent.AgentMain.failoverToNewServer(AgentMain.java:1838) - locked <0xac9fa090> (a [J) at org.rhq.enterprise.agent.FailoverFailureCallback.failureDetected(FailoverFailureCallback.java:104) at org.rhq.enterprise.communications.command.client.JBossRemotingRemoteCommunicator.invokeFailureCallbackIfNeeded(JBossRemotingRemoteCommunicator.java:580) at org.rhq.enterprise.communications.command.client.JBossRemotingRemoteCommunicator.sendWithoutInitializeCallback(JBossRemotingRemoteCommunicator.java:453) at org.rhq.enterprise.communications.command.client.JBossRemotingRemoteCommunicator.send(JBossRemotingRemoteCommunicator.java:471) at org.rhq.enterprise.communications.command.client.AbstractCommandClient.invoke(AbstractCommandClient.java:143) at org.rhq.enterprise.communications.command.client.ClientCommandSender.send(ClientCommandSender.java:1087) at org.rhq.enterprise.communications.command.client.ClientCommandSenderTask.send(ClientCommandSenderTask.java:229) at org.rhq.enterprise.communications.command.client.ClientCommandSenderTask.call(ClientCommandSenderTask.java:107) at org.rhq.enterprise.communications.command.client.ClientCommandSenderTask.call(ClientCommandSenderTask.java:55) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334) at java.util.concurrent.FutureTask.run(FutureTask.java:166) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) at java.lang.Thread.run(Thread.java:675) Found 1 deadlock. Heap def new generation total 9088K, used 1311K [0xabfc0000, 0xac990000, 0xac990000) eden space 8128K, 9% used [0xabfc0000, 0xac084fe0, 0xac7b0000) from space 960K, 54% used [0xac8a0000, 0xac922dd0, 0xac990000) to space 960K, 0% used [0xac7b0000, 0xac7b0000, 0xac8a0000) tenured generation total 121024K, used 116706K [0xac990000, 0xb3fc0000, 0xb3fc0000) the space 121024K, 96% used [0xac990000, 0xb3b88bb8, 0xb3b88c00, 0xb3fc0000) compacting perm gen total 30464K, used 30250K [0xb3fc0000, 0xb5d80000, 0xb7fc0000) the space 30464K, 99% used [0xb3fc0000, 0xb5d4aa20, 0xb5d4ac00, 0xb5d80000) No shared spaces configured.
do you see this message in the agent logs (probably alot of times): "The initialize callback has failed. It will be tried again. Cause: {0}" That would mean the "connect" message is failing to go out and everything else will block behind it. This stuff is gonna block until that connect message is successful. It's blocked here: synchronized (m_needToCallInitializeCallback) { see: org.rhq.enterprise.communications.command.client.JBossRemotingRemoteCommunicator.invokeInitializeCallbackIfNeeded(Command) "If there is an initialize callback set, this method will block all callers until the callback has been invoked." I guess we need to have a timeout in here and abort after some large amount of time passes? I'll need to refactor this to use some java.util.concurrent class, right now its just using the synchronized keyword and that's what is blocking. I'll need another way to block and timeout. This is deep in the bowels of this comm stuff, will need to be careful mucking with this to make sure its done correctly. I still say it is because the server was down or otherwise not accepting messages from the agent. If a server is down for a long time, this will happen. This code has been localized pretty well. We can refactor this to use a java.util.concurrent lock and have it timeout after some really long time (just to have it avoid blocking forever - I'm thinking like an hour timeout or something on that magnitude - give the server time to come back up - not too long to crap out your box like you are seeing but not too short to circumvent the "real" timeouts we normally have in our messages. I'll make this long timeout configurable :)
Should be triaged for 1.3.1
Temporarily adding the keyword "SubBug" so we can be sure we have accounted for all the bugs. keyword: new = Tracking + FutureFeature + SubBug
making sure we're not missing any bugs in rhq_triage
Created attachment 395191 [details] patch to timeout if the lock acquisition takes too long
opps... need to make sure we ignore the lock failure if the init callback doesn't need to be invoked: diff --git a/modules/enterprise/comm/src/main/java/org/rhq/enterprise/communications/command/client/JBossRemotingRemoteCommunicator.java b/modules/enterprise/comm/src/main/java/org/rhq/enterprise/communications/command/client/JBossRemotingRemoteCommunicator.java index 762b9b4..8c37b50 100644 --- a/modules/enterprise/comm/src/main/java/org/rhq/enterprise/communications/command/client/JBossRemotingRemoteCommunicator.java +++ b/modules/enterprise/comm/src/main/java/org/rhq/enterprise/communications/command/client/JBossRemotingRemoteCommunicator.java @@ -588,10 +588,11 @@ public class JBossRemotingRemoteCommunicator implements RemoteCommunicator { writeLock.unlock(); } } else { - m_needToCallInitializeCallback = true; // can't invoke callback, we'll want to still call it later - Throwable t = new Throwable("Initialize callback lock could not be acquired"); - LOG.error(CommI18NResourceKeys.INITIALIZE_CALLBACK_FAILED, t.getMessage()); - return new GenericCommandResponse(command, false, null, t); + if (m_needToCallInitializeCallback) { + Throwable t = new Throwable("Initialize callback lock could not be acquired"); + LOG.error(CommI18NResourceKeys.INITIALIZE_CALLBACK_FAILED, t.getMessage()); + return new GenericCommandResponse(command, false, null, t); + } } } return null;
errr... on second thought, perhaps we should always throw an exception on lock acquisition failure.
follow commits to master fix this issue: 60fa4314138911d607d6e95110987feb5a7b8fb7 1740b73d4ec9dddf1222fcf99b69a8164c090bbb e1c760378fe3f3fdcbee42beef05f9930eb3fd65 The lock acquisition timeout defaults to 60 minutes. However, this is configurable if for some reason that isn't good enough. Its only configurable via -D sysprop passed to the agent: rhq.communications.initial-callback-lock-wait-mins if not specified, or if the value specified is not a valid number, it will default to 60 minutes.
qa -> skondkar
this is a difficult one to test. However, if we use Byteman (http://www.jboss.org/byteman) to artificially cause a deadlock in the code, we can force my lock timeout to trigger (if we see the lock timeout trigger, we know we are not going to infinitely lock in the place where we were before). First, download the three attachments (byteman.jar and agent-byteman.rules and AGENT_SETUP.sh) in a single directory. Edit AGENT_SETUP.sh so the first line (setting up RHQ_AGENT_HOME) points to your agent home directory (the rhq-agent directory). Now run AGENT_SETUP.sh. When the agent starts, you want to execute the "shutdown" command at the agent prompt. Then execute the "start" command at the agent prompt. You will see the byteman rules output things like "CONNECTING TO THE SERVER" and "BLOCKING!". Once you get the agent prompt back (it might take a minute before you see the "started>" prompt), you need to next execute the "sender start" agent command at the agent prompt. This forces the agent to start its internal sender (so it thinks its connected to the server). Let this run for a couple of minutes. Now go into the agent log file and confirm you see this log message (or things like it). If you see the error message like: 2010-04-01 14:24:18,476 ERROR [ClientCommandSenderTask Timer Thread #4] (enterprise.communications.command.client.JBossRemotingRemoteCommunicator)- {JBossRemotingRemoteCommunicator.init-callback-failed}The initialize callback has failed. It will be tried again. Cause: Initialize callback lock could not be acquired "Initialize callback lock could not be acquired" means this fix is working. What is happening is we waited for the lock to get acquired but (unlike before) after a certain period of time that wait will be aborted and that error message is spit out.
Created attachment 404080 [details] byteman.jar
Created attachment 404081 [details] AGENT_SETUP.sh
Created attachment 404082 [details] agent-byteman.rules
Thanks for the detailed steps. Verified on master build#153 (Revision: d5c27afe5b24dca4c64c8e68a13f46a8a5d0dcd0). Followed the steps as provided in comment 10 and got the following log messages in agent log file. 2010-04-07 17:49:02,292 ERROR [ClientCommandSenderTask Timer Thread #1] (enterprise.communications.command.client.JBossRemotingRemoteCommunicator)- {JBossRemotingRemoteCommunicator.init-callback-failed}The initialize callback has failed. It will be tried again. Cause: Initialize callback lock could not be acquired 2010-04-07 17:51:02,329 ERROR [RHQ Server Polling Thread] (enterprise.communications.command.client.JBossRemotingRemoteCommunicator)- {JBossRemotingRemoteCommunicator.init-callback-failed}The initialize callback has failed. It will be tried again. Cause: Initialize callback lock could not be acquired
Mass-closure of verified bugs against JON.