Description of problem: I've observed agents hanging at failover. It seems that RHQ does not set any sort of agent timeout. "ClientCommandSenderTask Timer Thread #4489" daemon prio=10 tid=0x0000000045514000 nid=0x6ba7 runnable [0x0000000046ce3000] java.lang.Thread.State: RUNNABLE at java.net.SocketInputStream.socketRead0(Native Method) at java.net.SocketInputStream.read(SocketInputStream.java:129) at java.io.BufferedInputStream.fill(BufferedInputStream.java:218) at java.io.BufferedInputStream.read1(BufferedInputStream.java:258) at java.io.BufferedInputStream.read(BufferedInputStream.java:317) - locked <0x00000000e1389748> (a java.io.BufferedInputStream) at sun.net.www.http.HttpClient.parseHTTPHeader(HttpClient.java:697) at sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:640) at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1195) - locked <0x00000000e13897f0> (a sun.net.www.protocol.http.HttpURLConnection) at java.net.HttpURLConnection.getResponseCode(HttpURLConnection.java:379) at org.jboss.remoting.transport.http.HTTPClientInvoker.getResponseCode(HTTPClientInvoker.java:1325) at org.jboss.remoting.transport.http.HTTPClientInvoker.useHttpURLConnection(HTTPClientInvoker.java:372) at org.jboss.remoting.transport.http.HTTPClientInvoker.makeInvocation(HTTPClientInvoker.java:253) at org.jboss.remoting.transport.http.HTTPClientInvoker.transport(HTTPClientInvoker.java:176) at org.jboss.remoting.MicroRemoteClientInvoker.invoke(MicroRemoteClientInvoker.java:169) at org.jboss.remoting.Client.invoke(Client.java:2084) at org.jboss.remoting.Client.invoke(Client.java:879) at org.rhq.enterprise.communications.command.client.JBossRemotingRemoteCommunicator.rawSend(JBossRemotingRemoteCommunicator.java:514) at org.rhq.enterprise.communications.command.client.JBossRemotingRemoteCommunicator.sendWithoutCallbacks(JBossRemotingRemoteCommunicator.java:456) at org.rhq.enterprise.agent.AgentMain.sendConnectRequestToServer(AgentMain.java:2114) at org.rhq.enterprise.agent.AgentMain.switchCommServer(AgentMain.java:2049) at org.rhq.enterprise.agent.AgentMain.failoverToNewServer(AgentMain.java:2007) - locked <0x00000000e0164af8> (a [J) at org.rhq.enterprise.agent.FailoverFailureCallback.failureDetected(FailoverFailureCallback.java:104) at org.rhq.enterprise.communications.command.client.JBossRemotingRemoteCommunicator.invokeFailureCallbackIfNeeded(JBossRemotingRemoteCommunicator.java:625) at org.rhq.enterprise.communications.command.client.JBossRemotingRemoteCommunicator.sendWithoutInitializeCallback(JBossRemotingRemoteCommunicator.java:478) at org.rhq.enterprise.communications.command.client.JBossRemotingRemoteCommunicator.send(JBossRemotingRemoteCommunicator.java:496) at org.rhq.enterprise.communications.command.client.AbstractCommandClient.invoke(AbstractCommandClient.java:143) at org.rhq.enterprise.communications.command.client.ClientCommandSender.send(ClientCommandSender.java:1084) at org.rhq.enterprise.communications.command.client.ClientCommandSenderTask.send(ClientCommandSenderTask.java:229) at org.rhq.enterprise.communications.command.client.ClientCommandSenderTask.call(ClientCommandSenderTask.java:107) at org.rhq.enterprise.communications.command.client.ClientCommandSenderTask.call(ClientCommandSenderTask.java:55) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at java.util.concurrent.FutureTask.run(FutureTask.java:138) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:662) ... "ClientCommandSenderTask Timer Thread #4488" daemon prio=10 tid=0x000000004436d000 nid=0x6ba6 waiting for monitor entry [0x0000000046ee6000] java.lang.Thread.State: BLOCKED (on object monitor) at org.rhq.enterprise.agent.AgentMain.failoverToNewServer(AgentMain.java:1992) - waiting to lock <0x00000000e0164af8> (a [J) at org.rhq.enterprise.agent.FailoverFailureCallback.failureDetected(FailoverFailureCallback.java:104) at org.rhq.enterprise.communications.command.client.JBossRemotingRemoteCommunicator.invokeFailureCallbackIfNeeded(JBossRemotingRemoteCommunicator.java:625) at org.rhq.enterprise.communications.command.client.JBossRemotingRemoteCommunicator.sendWithoutInitializeCallback(JBossRemotingRemoteCommunicator.java:478) at org.rhq.enterprise.communications.command.client.JBossRemotingRemoteCommunicator.send(JBossRemotingRemoteCommunicator.java:496) at org.rhq.enterprise.communications.command.client.AbstractCommandClient.invoke(AbstractCommandClient.java:143) at org.rhq.enterprise.communications.command.client.ClientCommandSender.send(ClientCommandSender.java:1084) at org.rhq.enterprise.communications.command.client.ClientCommandSenderTask.send(ClientCommandSenderTask.java:229) at org.rhq.enterprise.communications.command.client.ClientCommandSenderTask.call(ClientCommandSenderTask.java:107) at org.rhq.enterprise.communications.command.client.ClientCommandSenderTask.call(ClientCommandSenderTask.java:55) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at java.util.concurrent.FutureTask.run(FutureTask.java:138) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:662) "RHQ Server Polling Thread" daemon prio=10 tid=0x00002aaab0d5c000 nid=0x5f56 waiting for monitor entry [0x0000000043de8000] java.lang.Thread.State: BLOCKED (on object monitor) at org.rhq.enterprise.agent.AgentMain.failoverToNewServer(AgentMain.java:1992) - waiting to lock <0x00000000e0164af8> (a [J) at org.rhq.enterprise.agent.FailoverFailureCallback.failureDetected(FailoverFailureCallback.java:104) at org.rhq.enterprise.communications.command.client.JBossRemotingRemoteCommunicator.invokeFailureCallbackIfNeeded(JBossRemotingRemoteCommunicator.java:625) at org.rhq.enterprise.communications.command.client.JBossRemotingRemoteCommunicator.sendWithoutInitializeCallback(JBossRemotingRemoteCommunicator.java:478) at org.rhq.enterprise.communications.command.client.JBossRemotingRemoteCommunicator.send(JBossRemotingRemoteCommunicator.java:496) at org.rhq.enterprise.communications.command.client.AbstractCommandClient.invoke(AbstractCommandClient.java:143) at org.rhq.enterprise.communications.command.client.ClientCommandSender.send(ClientCommandSender.java:1084) at org.rhq.enterprise.communications.command.client.ServerPollingThread.run(ServerPollingThread.java:100) Version-Release number of selected component (if applicable): 4.9 How reproducible: sometimes Steps to Reproduce: 1. Create an HTTP listener that does not accept a socket or simply does not send any response Actual results: See that the agent hangs and will not failover the working server Expected results: Failover to correct server. Additional info: It appears JBoss Remoting has a 'timeout' option which can be set. This probably works but hasn't been tested... diff --git a/modules/enterprise/agent/src/main/java/org/rhq/enterprise/agent/AgentMain.java b/modules/enterprise/agent/src/main/java/org/rhq/enterprise/agent/AgentMain.java index b46fb4b..3274a8e 100644 --- a/modules/enterprise/agent/src/main/java/org/rhq/enterprise/agent/AgentMain.java +++ b/modules/enterprise/agent/src/main/java/org/rhq/enterprise/agent/AgentMain.java @@ -2692,6 +2692,13 @@ private RemoteCommunicator createServerRemoteCommunicator(String uri, boolean wi config.put(HTTPSClientInvoker.IGNORE_HTTPS_HOST, "true"); } + // The HTTP transport can hang for a long, long time + // If for example the server is hung, this ensures we do not wait forever and can failover + long timeout = m_configuration.getClientSenderCommandTimeout() / 1000; + if (timeout > 0) { + config.put("timeout", Long.toString(timeout)); + } + RemoteCommunicator remote_comm = new JBossRemotingRemoteCommunicator(uri, config); if (withFailover) { remote_comm.setFailureCallback(new FailoverFailureCallback(this));
Note the default timeout is 10 minutes, which is really too long for a reasonable connect or read timeout. I'm not sure it's worth having an additional setting or not, however.
Created attachment 856768 [details] Patch for RHQ_4_9_0 Note that the patch is different than what is shown in the bug.
Merged in master commit 9302f90bc0d81905bf62649666b7df0913acdac7 Author: Elias Ross <genman> Date: Wed Jan 29 17:02:42 2014 +0100
Bulk closing of 4.10 issues. If an issue is not solved for you, please open a new BZ (or clone the existing one) with a version designator of 4.10.