Description of problem: This is from 4.9 but I think there is something similar in 4.12. "main" prio=10 tid=0x00007f95f8006800 nid=0x7060 waiting for monitor entry [0x00007f95fe76f000] java.lang.Thread.State: BLOCKED (on object monitor) at org.jboss.remoting.transport.socket.ServerThread.shutdown(ServerThread.java:344) - waiting to lock <0x00000000f6868070> (a org.jboss.remoting.transport.socket.ServerThread) at org.jboss.remoting.transport.socket.SocketServerInvoker.cleanup(SocketServerInvoker.java:473) - locked <0x00000000f6868140> (a org.jboss.remoting.transport.socket.LRUPool) at org.jboss.remoting.transport.socket.SocketServerInvoker.stop(SocketServerInvoker.java:427) - locked <0x00000000f68800f0> (a org.jboss.remoting.transport.socket.SocketServerInvoker) at org.jboss.remoting.transport.Connector.stop(Connector.java:1175) at org.rhq.enterprise.communications.ServiceContainer.shutdown(ServiceContainer.java:570) at org.rhq.enterprise.agent.AgentMain.shutdownCommServices(AgentMain.java:2475) at org.rhq.enterprise.agent.AgentMain.shutdown(AgentMain.java:836) - locked <0x00000000f688a540> (a java.lang.Object) at org.rhq.enterprise.agent.AgentMain.start(AgentMain.java:722) - locked <0x00000000f688a540> (a java.lang.Object) at org.rhq.enterprise.agent.AgentMain.main(AgentMain.java:437) "WorkerThread#0[17.172.21.185:44011]" prio=10 tid=0x00007f9570028800 nid=0x7084 waiting on condition [0x00007f95d6d9d000] java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for <0x00000000f68929a0> (a java.util.concurrent.CountDownLatch$Sync) at java.util.concurrent.locks.LockSupport.park(LockSupport.java:156) at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:811) at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:969) at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1281) at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:207) at org.rhq.enterprise.agent.AgentMain$BootstrapLatchCommandListener.receivedCommand(AgentMain.java:3635) at org.rhq.enterprise.communications.command.server.CommandProcessor.notifyListenersOfReceivedCommand(CommandProcessor.java:402) at org.rhq.enterprise.communications.command.server.CommandProcessor.handleIncomingInvocationRequest(CommandProcessor.java:242) at org.rhq.enterprise.communications.command.server.CommandProcessor.invoke(CommandProcessor.java:184) at org.jboss.remoting.ServerInvoker.invoke(ServerInvoker.java:967) at org.jboss.remoting.transport.socket.ServerThread.completeInvocation(ServerThread.java:791) - locked <0x00000000f6868070> (a org.jboss.remoting.transport.socket.ServerThread) at org.jboss.remoting.transport.socket.ServerThread.processInvocation(ServerThread.java:744) at org.jboss.remoting.transport.socket.ServerThread.dorun(ServerThread.java:586) at org.jboss.remoting.transport.socket.ServerThread.run(ServerThread.java:234) What is going on the WorkerThread is locking the ServerThread object, and 'main' cannot lock the same object. ServerThread: (decompiled) /* */ public synchronized void shutdown() /* */ { /* 344 */ if (trace) log.trace("attempting to shut down " + this); /* 345 */ this.shutdown = true; AgentMain: (simple version) private class BootstrapLatchCommandListener implements CommandListener { private final CountDownLatch m_latch = new CountDownLatch(1); private final String PING_INTERFACE_NAME = Ping.class.getName(); public void allowAllCommands(ServiceContainer container) { m_latch.countDown(); } public void receivedCommand(Command command) { m_latch.await(); // Timeout should go here? } What needs to happen is either the countdown latch needs to time out, or some sort of thread interruption needs to take place for the command listeners. Either way, it is not good to simply block forever. Here are the logs: 2014-08-09 20:03:05,171 ERROR [RHQ Agent Registration Thread] (AgentMain)- {AgentMain.agent-registration-rejected}The server has rejected the agent registration request. Cause: [org.rhq.core.clientapi.server.core.AgentRegistrationException:The agent [xyz] is attempting to re-register without a security token. Please consult an administrator to obtain the agent's proper security token and restart the agent with the option "-Drhq.agent.security-token=<the valid security token>". An administrator can find the agent's security token by navigating to the GUI page "Administration (Topology) > Agents" and drilling down to this specific agent. You will see the long security token string there. For more information, read: https://docs.jboss.org/author/display/RHQ/Agent+Registration] 2014-08-09 20:03:05,172 ERROR [RHQ Agent Registration Thread] (AgentMain)- {AgentMain.agent-registration-retry}Will retry the agent registration request soon... 2014-08-09 20:03:33,037 FATAL [main] (AgentMain)- {AgentMain.startup-error}The agent encountered an error during startup and must abort org.rhq.core.clientapi.server.core.AgentRegistrationException: The agent cannot register with the server. Admin intervention needed! at org.rhq.enterprise.agent.AgentMain.start(AgentMain.java:670) at org.rhq.enterprise.agent.AgentMain.main(AgentMain.java:437) 2014-08-09 20:03:33,039 INFO [main] (AgentMain)- {AgentMain.shutting-down}Agent is being shut down... 2014-08-09 20:03:33,049 INFO [main] (PluginContainer)- Plugin container is already shut down. 2014-08-09 20:03:33,050 WARN [main] (AgentMain)- {AgentMain.failed-to-shutdown-component}Agent failed to shutdown component [Plugin Container]. Cause: java.lang.NullPointerException:null 2014-08-09 20:03:33,202 WARN [main] (AgentMain)- {AgentMain.server-shutdown-notification-failure}Agent failed to notify the server of the pending shutdown. Cause: org.rhq.enterprise.communications.command.server.AuthenticationException:Command failed to be authenticated! This command will be ignored and not processed: Command: type=[remotepojo]; cmd-in-response=[false]; config=[{rhq.agent-name=xyz, rhq.externalizable-strategy=AGENT, rhq.timeout=10000}]; params=[{targetInterfaceName=org.rhq.core.clientapi.server.core.CoreServerService, invocation=NameBasedInvocation[agentIsShuttingDown]}] 2014-08-09 20:03:33,202 INFO [RHQ Server Polling Thread] (ServerPollingThread)- {ServerPollingThread.server-online}The server has come back online; client has been told to start sending commands again 2014-08-09 20:03:33,204 INFO [main] (ServiceContainer)- {ServiceContainer.shutting-down}Service container shutting down... --- Note the Agent is still stuck in shutdown 5 days later -- Version-Release number of selected component (if applicable): 4.9 How reproducible: Unclear, depends on conditions I suppose Steps to Reproduce: (possible?) 1. Start the agent, but the agent fails to register 2. The server sends a command to the agent, which triggers an 'await' 3. Attempt to shutdown Actual results: Hung, similar stacktrace Expected results: Agent stops Additional info: This isn't the exact 4.9 release, I have patched with changes that went into 4.10. But the same code exists in master and 4.12.
Fixed in master, thanks Elias. Branch: refs/heads/master Home: https://github.com/rhq-project/rhq Commit: 27af90bb1aa82733ce022596f8fcf1f4564d11d3 https://github.com/rhq-project/rhq/commit/27af90bb1aa82733ce022596f8fcf1f4564d11d3 Author: Elias Ross <elias_ross> Date: 2014-10-15 (Wed, 15 Oct 2014) Changed paths: M modules/enterprise/agent/src/main/java/org/rhq/enterprise/agent/AgentMain.java M modules/enterprise/comm/src/main/java/org/rhq/enterprise/communications/command/server/CommandProcessor.java Log Message: ----------- BZ 1130341 - Agent can be stuck waiting for startup The agent can block forever in a JBoss remoting callback if it receives a message that triggers a wait on the latch. Change to wait a finite amount of time, then throw an exception. Also, clean up the CommandProcessor to use CopyOnWriteArrayList, no need to copy lists like this. Commit: db607e13a2772f6655399e418875a0a45e373ab9 https://github.com/rhq-project/rhq/commit/db607e13a2772f6655399e418875a0a45e373ab9 Author: Elias Ross <elias_ross> Date: 2014-10-17 (Fri, 17 Oct 2014) Changed paths: M modules/enterprise/agent/src/main/java/org/rhq/enterprise/agent/AgentMain.java Log Message: ----------- For timeout at startup, use 30 seconds (fixed) Commit: a260f33265a26ddf64c3372380c8c09f7b5be82c https://github.com/rhq-project/rhq/commit/a260f33265a26ddf64c3372380c8c09f7b5be82c Author: Elias Ross <elias_ross> Date: 2014-11-03 (Mon, 03 Nov 2014) Changed paths: M modules/enterprise/agent/src/main/java/org/rhq/enterprise/agent/AgentMain.java Log Message: ----------- Use NotProcessedException to indicate startup latch timed out Commit: 288fac0e6a540061a940d0d3be8b69e79e15e8bd https://github.com/rhq-project/rhq/commit/288fac0e6a540061a940d0d3be8b69e79e15e8bd Author: Thomas Segismont <tsegismont> Date: 2014-11-20 (Thu, 20 Nov 2014) Changed paths: M modules/enterprise/agent/src/main/java/org/rhq/enterprise/agent/AgentMain.java M modules/enterprise/comm/src/main/java/org/rhq/enterprise/communications/command/server/CommandProcessor.java Log Message: ----------- Merge pull request #141 from genman/bug/1130341 BZ 1130341 - Agent can be stuck waiting for startup Compare: https://github.com/rhq-project/rhq/compare/9fa2131e6a62...288fac0e6a54