Bug 1130341

Summary: Agent can block indefinitely if it has received a remote command at startup but then asked to shutdown
Product: [Other] RHQ Project Reporter: Elias Ross <genman>
Component: AgentAssignee: Thomas Segismont <tsegismo>
Status: ON_QA --- QA Contact:
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 4.9CC: hrupp, tsegismo
Target Milestone: ---   
Target Release: RHQ 4.13   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Elias Ross 2014-08-15 00:22:13 UTC
Description of problem:

This is from 4.9 but I think there is something similar in 4.12.

"main" prio=10 tid=0x00007f95f8006800 nid=0x7060 waiting for monitor entry [0x00007f95fe76f000]
   java.lang.Thread.State: BLOCKED (on object monitor)
        at org.jboss.remoting.transport.socket.ServerThread.shutdown(ServerThread.java:344)
        - waiting to lock <0x00000000f6868070> (a org.jboss.remoting.transport.socket.ServerThread)
        at org.jboss.remoting.transport.socket.SocketServerInvoker.cleanup(SocketServerInvoker.java:473)
        - locked <0x00000000f6868140> (a org.jboss.remoting.transport.socket.LRUPool)
        at org.jboss.remoting.transport.socket.SocketServerInvoker.stop(SocketServerInvoker.java:427)
        - locked <0x00000000f68800f0> (a org.jboss.remoting.transport.socket.SocketServerInvoker)
        at org.jboss.remoting.transport.Connector.stop(Connector.java:1175)
        at org.rhq.enterprise.communications.ServiceContainer.shutdown(ServiceContainer.java:570)
        at org.rhq.enterprise.agent.AgentMain.shutdownCommServices(AgentMain.java:2475)
        at org.rhq.enterprise.agent.AgentMain.shutdown(AgentMain.java:836)
        - locked <0x00000000f688a540> (a java.lang.Object)
        at org.rhq.enterprise.agent.AgentMain.start(AgentMain.java:722)
        - locked <0x00000000f688a540> (a java.lang.Object)
        at org.rhq.enterprise.agent.AgentMain.main(AgentMain.java:437)


"WorkerThread#0[17.172.21.185:44011]" prio=10 tid=0x00007f9570028800 nid=0x7084 waiting on condition [0x00007f95d6d9d000]
   java.lang.Thread.State: WAITING (parking)
        at sun.misc.Unsafe.park(Native Method)
        - parking to wait for  <0x00000000f68929a0> (a java.util.concurrent.CountDownLatch$Sync)
        at java.util.concurrent.locks.LockSupport.park(LockSupport.java:156)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:811)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:969)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1281)
        at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:207)
        at org.rhq.enterprise.agent.AgentMain$BootstrapLatchCommandListener.receivedCommand(AgentMain.java:3635)
        at org.rhq.enterprise.communications.command.server.CommandProcessor.notifyListenersOfReceivedCommand(CommandProcessor.java:402)
        at org.rhq.enterprise.communications.command.server.CommandProcessor.handleIncomingInvocationRequest(CommandProcessor.java:242)
        at org.rhq.enterprise.communications.command.server.CommandProcessor.invoke(CommandProcessor.java:184)
        at org.jboss.remoting.ServerInvoker.invoke(ServerInvoker.java:967)
        at org.jboss.remoting.transport.socket.ServerThread.completeInvocation(ServerThread.java:791)
        - locked <0x00000000f6868070> (a org.jboss.remoting.transport.socket.ServerThread)
        at org.jboss.remoting.transport.socket.ServerThread.processInvocation(ServerThread.java:744)
        at org.jboss.remoting.transport.socket.ServerThread.dorun(ServerThread.java:586)
        at org.jboss.remoting.transport.socket.ServerThread.run(ServerThread.java:234)


What is going on the WorkerThread is locking the ServerThread object, and 'main' cannot lock the same object.

ServerThread: (decompiled)

/*      */   public synchronized void shutdown()
/*      */   {
/*  344 */     if (trace) log.trace("attempting to shut down " + this);
/*  345 */     this.shutdown = true;

AgentMain: (simple version)

    private class BootstrapLatchCommandListener implements CommandListener {

        private final CountDownLatch m_latch = new CountDownLatch(1);
        private final String PING_INTERFACE_NAME = Ping.class.getName();

        public void allowAllCommands(ServiceContainer container) {
            m_latch.countDown();
        }

        public void receivedCommand(Command command) {
            m_latch.await(); // Timeout should go here?
        }

What needs to happen is either the countdown latch needs to time out, or some sort of thread interruption needs to take place for the command listeners. Either way, it is not good to simply block forever.

Here are the logs:


2014-08-09 20:03:05,171 ERROR [RHQ Agent Registration Thread] (AgentMain)- {AgentMain.agent-registration-rejected}The server has rejected the agent registration request. Cause: [org.rhq.core.clientapi.server.core.AgentRegistrationException:The agent [xyz] is attempting to re-register without a security token. Please consult an administrator to obtain the agent's proper security token and restart the agent with the option "-Drhq.agent.security-token=<the valid security token>". An administrator can find the agent's security token by navigating to the GUI page "Administration (Topology) > Agents" and drilling down to this specific agent. You will see the long security token string there. For more information, read: https://docs.jboss.org/author/display/RHQ/Agent+Registration]
2014-08-09 20:03:05,172 ERROR [RHQ Agent Registration Thread] (AgentMain)- {AgentMain.agent-registration-retry}Will retry the agent registration request soon...
2014-08-09 20:03:33,037 FATAL [main] (AgentMain)- {AgentMain.startup-error}The agent encountered an error during startup and must abort
org.rhq.core.clientapi.server.core.AgentRegistrationException: The agent cannot register with the server. Admin intervention needed!
        at org.rhq.enterprise.agent.AgentMain.start(AgentMain.java:670)
        at org.rhq.enterprise.agent.AgentMain.main(AgentMain.java:437)
2014-08-09 20:03:33,039 INFO  [main] (AgentMain)- {AgentMain.shutting-down}Agent is being shut down...
2014-08-09 20:03:33,049 INFO  [main] (PluginContainer)- Plugin container is already shut down.
2014-08-09 20:03:33,050 WARN  [main] (AgentMain)- {AgentMain.failed-to-shutdown-component}Agent failed to shutdown component [Plugin Container]. Cause: java.lang.NullPointerException:null
2014-08-09 20:03:33,202 WARN  [main] (AgentMain)- {AgentMain.server-shutdown-notification-failure}Agent failed to notify the server of the pending shutdown. Cause: org.rhq.enterprise.communications.command.server.AuthenticationException:Command failed to be authenticated!  This command will be ignored and not processed: Command: type=[remotepojo]; cmd-in-response=[false]; config=[{rhq.agent-name=xyz, rhq.externalizable-strategy=AGENT, rhq.timeout=10000}]; params=[{targetInterfaceName=org.rhq.core.clientapi.server.core.CoreServerService, invocation=NameBasedInvocation[agentIsShuttingDown]}]
2014-08-09 20:03:33,202 INFO  [RHQ Server Polling Thread] (ServerPollingThread)- {ServerPollingThread.server-online}The server has come back online; client has been told to start sending commands again
2014-08-09 20:03:33,204 INFO  [main] (ServiceContainer)- {ServiceContainer.shutting-down}Service container shutting down...

--- Note the Agent is still stuck in shutdown 5 days later --


Version-Release number of selected component (if applicable): 4.9


How reproducible: Unclear, depends on conditions I suppose


Steps to Reproduce: (possible?)
1. Start the agent, but the agent fails to register
2. The server sends a command to the agent, which triggers an 'await'
3. Attempt to shutdown


Actual results: Hung, similar stacktrace


Expected results: Agent stops


Additional info: This isn't the exact 4.9 release, I have patched with changes that went into 4.10. But the same code exists in master and 4.12.

Comment 1 Thomas Segismont 2014-11-20 09:31:22 UTC
Fixed in master, thanks Elias.

  Branch: refs/heads/master
  Home:   https://github.com/rhq-project/rhq
  Commit: 27af90bb1aa82733ce022596f8fcf1f4564d11d3
      https://github.com/rhq-project/rhq/commit/27af90bb1aa82733ce022596f8fcf1f4564d11d3
  Author: Elias Ross <elias_ross>
  Date:   2014-10-15 (Wed, 15 Oct 2014)

  Changed paths:
    M modules/enterprise/agent/src/main/java/org/rhq/enterprise/agent/AgentMain.java
    M modules/enterprise/comm/src/main/java/org/rhq/enterprise/communications/command/server/CommandProcessor.java

  Log Message:
  -----------
  BZ 1130341 - Agent can be stuck waiting for startup

The agent can block forever in a JBoss remoting callback if it receives a
message that triggers a wait on the latch. Change to wait a finite amount of
time, then throw an exception.

Also, clean up the CommandProcessor to use CopyOnWriteArrayList, no need to
copy lists like this.


  Commit: db607e13a2772f6655399e418875a0a45e373ab9
      https://github.com/rhq-project/rhq/commit/db607e13a2772f6655399e418875a0a45e373ab9
  Author: Elias Ross <elias_ross>
  Date:   2014-10-17 (Fri, 17 Oct 2014)

  Changed paths:
    M modules/enterprise/agent/src/main/java/org/rhq/enterprise/agent/AgentMain.java

  Log Message:
  -----------
  For timeout at startup, use 30 seconds (fixed)


  Commit: a260f33265a26ddf64c3372380c8c09f7b5be82c
      https://github.com/rhq-project/rhq/commit/a260f33265a26ddf64c3372380c8c09f7b5be82c
  Author: Elias Ross <elias_ross>
  Date:   2014-11-03 (Mon, 03 Nov 2014)

  Changed paths:
    M modules/enterprise/agent/src/main/java/org/rhq/enterprise/agent/AgentMain.java

  Log Message:
  -----------
  Use NotProcessedException to indicate startup latch timed out


  Commit: 288fac0e6a540061a940d0d3be8b69e79e15e8bd
      https://github.com/rhq-project/rhq/commit/288fac0e6a540061a940d0d3be8b69e79e15e8bd
  Author: Thomas Segismont <tsegismont>
  Date:   2014-11-20 (Thu, 20 Nov 2014)

  Changed paths:
    M modules/enterprise/agent/src/main/java/org/rhq/enterprise/agent/AgentMain.java
    M modules/enterprise/comm/src/main/java/org/rhq/enterprise/communications/command/server/CommandProcessor.java

  Log Message:
  -----------
  Merge pull request #141 from genman/bug/1130341

BZ 1130341 - Agent can be stuck waiting for startup


Compare: https://github.com/rhq-project/rhq/compare/9fa2131e6a62...288fac0e6a54