Bug 1120422 - RHQ agent deadlock in RHQ Server Polling Thread
Summary: RHQ agent deadlock in RHQ Server Polling Thread
Keywords:
Status: NEW
Alias: None
Product: RHQ Project
Classification: Other
Component: Agent
Version: 4.12
Hardware: Unspecified
OS: Unspecified
unspecified
urgent vote
Target Milestone: ---
: RHQ 4.13
Assignee: RHQ Project Maintainer
QA Contact: Mike Foley
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2014-07-16 22:05 UTC by Elias Ross
Modified: 2014-07-21 22:47 UTC (History)
2 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:


Attachments (Terms of Use)

Description Elias Ross 2014-07-16 22:05:14 UTC
Description of problem:


Found one Java-level deadlock:
=============================
"RHQ Server Polling Thread":
  waiting for ownable synchronizer 0x00000000e01bbc80, (a java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync),
  which is held by "ClientCommandSenderTask Timer Thread #10"
"ClientCommandSenderTask Timer Thread #10":
  waiting to lock monitor 0x00007f4b1443ed78 (object 0x00000000e0223d60, a [J),
  which is held by "ClientCommandSenderTask Timer Thread #4"
"ClientCommandSenderTask Timer Thread #4":
  waiting for ownable synchronizer 0x00000000e0223dc0, (a java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync),
  which is held by "ClientCommandSenderTask Timer Thread #10"

Java stack information for the threads listed above:
===================================================
"RHQ Server Polling Thread":
	at sun.misc.Unsafe.park(Native Method)
	- parking to wait for  <0x00000000e01bbc80> (a java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync)
	at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:196)
	at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireNanos(AbstractQueuedSynchronizer.java:905)
	at java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireNanos(AbstractQueuedSynchronizer.java:1224)
	at java.util.concurrent.locks.ReentrantReadWriteLock$WriteLock.tryLock(ReentrantReadWriteLock.java:976)
	at org.rhq.enterprise.communications.command.client.JBossRemotingRemoteCommunicator.invokeInitializeCallbackIfNeeded(JBossRemotingRemoteCommunicator.java:401)
	at org.rhq.enterprise.communications.command.client.JBossRemotingRemoteCommunicator.send(JBossRemotingRemoteCommunicator.java:323)
	at org.rhq.enterprise.communications.command.client.AbstractCommandClient.invoke(AbstractCommandClient.java:143)
	at org.rhq.enterprise.communications.command.client.ClientCommandSender.send(ClientCommandSender.java:1084)
	at org.rhq.enterprise.communications.command.client.ServerPollingThread.run(ServerPollingThread.java:100)
"ClientCommandSenderTask Timer Thread #10":
	at org.rhq.enterprise.agent.AgentMain.failoverToNewServer(AgentMain.java:2051)
	- waiting to lock <0x00000000e0223d60> (a [J)
	at org.rhq.enterprise.agent.FailoverFailureCallback.failureDetected(FailoverFailureCallback.java:104)
	at org.rhq.enterprise.communications.command.client.JBossRemotingRemoteCommunicator.invokeFailureCallbackIfNeeded(JBossRemotingRemoteCommunicator.java:457)
	at org.rhq.enterprise.communications.command.client.JBossRemotingRemoteCommunicator.sendWithoutInitializeCallback(JBossRemotingRemoteCommunicator.java:310)
	at org.rhq.enterprise.agent.AgentMain.sendConnectRequestToServer(AgentMain.java:2171)
	at org.rhq.enterprise.agent.ConnectAgentInitializeCallback.sendingInitialCommand(ConnectAgentInitializeCallback.java:43)
	at org.rhq.enterprise.communications.command.client.JBossRemotingRemoteCommunicator.invokeInitializeCallbackIfNeeded(JBossRemotingRemoteCommunicator.java:411)
	at org.rhq.enterprise.communications.command.client.JBossRemotingRemoteCommunicator.send(JBossRemotingRemoteCommunicator.java:323)
	at org.rhq.enterprise.communications.command.client.AbstractCommandClient.invoke(AbstractCommandClient.java:143)
	at org.rhq.enterprise.communications.command.client.ClientCommandSender.send(ClientCommandSender.java:1084)
	at org.rhq.enterprise.communications.command.client.ClientCommandSenderTask.send(ClientCommandSenderTask.java:229)
	at org.rhq.enterprise.communications.command.client.ClientCommandSenderTask.call(ClientCommandSenderTask.java:107)
	at org.rhq.enterprise.communications.command.client.ClientCommandSenderTask.call(ClientCommandSenderTask.java:55)
	at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
	at java.util.concurrent.FutureTask.run(FutureTask.java:138)
	at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
	at java.lang.Thread.run(Thread.java:662)
"ClientCommandSenderTask Timer Thread #4":
	at sun.misc.Unsafe.park(Native Method)
	- parking to wait for  <0x00000000e0223dc0> (a java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync)
	at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:196)
	at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireNanos(AbstractQueuedSynchronizer.java:905)
	at java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireNanos(AbstractQueuedSynchronizer.java:1224)
	at java.util.concurrent.locks.ReentrantReadWriteLock$WriteLock.tryLock(ReentrantReadWriteLock.java:976)
	at org.rhq.enterprise.agent.AgentMain.sendConnectRequestToServer(AgentMain.java:2155)
	at org.rhq.enterprise.agent.AgentMain.switchCommServer(AgentMain.java:2108)
	at org.rhq.enterprise.agent.AgentMain.failoverToNewServer(AgentMain.java:2066)
	- locked <0x00000000e0223d60> (a [J)
	at org.rhq.enterprise.agent.FailoverFailureCallback.failureDetected(FailoverFailureCallback.java:104)
	at org.rhq.enterprise.communications.command.client.JBossRemotingRemoteCommunicator.invokeFailureCallbackIfNeeded(JBossRemotingRemoteCommunicator.java:457)
	at org.rhq.enterprise.communications.command.client.JBossRemotingRemoteCommunicator.sendWithoutInitializeCallback(JBossRemotingRemoteCommunicator.java:310)
	at org.rhq.enterprise.communications.command.client.JBossRemotingRemoteCommunicator.send(JBossRemotingRemoteCommunicator.java:328)
	at org.rhq.enterprise.communications.command.client.AbstractCommandClient.invoke(AbstractCommandClient.java:143)
	at org.rhq.enterprise.communications.command.client.ClientCommandSender.send(ClientCommandSender.java:1084)
	at org.rhq.enterprise.communications.command.client.ClientCommandSenderTask.send(ClientCommandSenderTask.java:229)
	at org.rhq.enterprise.communications.command.client.ClientCommandSenderTask.call(ClientCommandSenderTask.java:107)
	at org.rhq.enterprise.communications.command.client.ClientCommandSenderTask.call(ClientCommandSenderTask.java:55)
	at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
	at java.util.concurrent.FutureTask.run(FutureTask.java:138)
	at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
	at java.lang.Thread.run(Thread.java:662)


Version-Release number of selected component (if applicable): 4.12


How reproducible: Unclear, looks due to client failover plus server time outs


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 5 Elias Ross 2014-07-21 22:47:04 UTC
A couple of qualifications:

1) I don't think this is a new problem in this release, but likely an old problem.
2) The server was 'busy' doing plugin updates and the like and probably not responsive.
3) It doesn't really feel like a traditional deadlock, in that the locks are acquired out of order.
4) I haven't observed this once the server stabilized.
5) It doesn't seem that serious of an issue.

If I were to address this as a design problem, I think the following makes sense:

The failover happens in a separate thread. That is, when failover happens, there is a switchover task that is run in an Executor. Then, the JBossRemotingRemoteCommunicator does not hold locks when failover happens. I'm not sure this is okay, but rather than 'wait', just throw exceptions until failover completes. I'm not sure the logic supports it.


Note You need to log in before you can comment on or make changes to this bug.