Description of problem: In one network is the RHQ server and the machine on which the agent is installed. When i'm run the agent, in the logs the error is: 2018-04-12 10:39:04,412 INFO [main] (org.rhq.enterprise.agent.AgentMain)- {AgentMain.identify-version}Version=[RHQ 4.13.1], Build Number=[f37fe58], Build Date=[Jan 16, 2015 7:56 PM] 2018-04-12 10:39:04,771 INFO [main] (org.rhq.enterprise.communications.ServiceContainer)- {ServiceContainer.global-concurrency-limit-disabled}Global concurrency limit has been disabled - there is no limit to the number of incoming commands allowed 2018-04-12 10:39:05,677 INFO [main] (org.rhq.enterprise.communications.ServiceContainer)- {ServiceContainer.started}Service container started - ready to accept incoming commands 2018-04-12 10:39:05,693 INFO [main] (org.rhq.enterprise.agent.AgentMain)- {AgentMain.no-auto-detect}Server auto-detection is not enabled - starting the poller immediately 2018-04-12 10:40:05,771 WARN [main] (org.rhq.enterprise.agent.AgentMain)- {AgentMain.startup-registration-failed-retry}A server appears to be up, but the agent is not yet registered. Will wait again. wait counter=[4] 2018-04-12 10:40:05,771 WARN [main] (org.rhq.enterprise.agent.AgentMain)- {AgentMain.startup-registration-failed-retry}A server appears to be up, but the agent is not yet registered. Will wait again. wait counter=[3] 2018-04-12 10:40:05,771 WARN [main] (org.rhq.enterprise.agent.AgentMain)- {AgentMain.startup-registration-failed-retry}A server appears to be up, but the agent is not yet registered. Will wait again. wait counter=[2] 2018-04-12 10:40:05,771 WARN [main] (org.rhq.enterprise.agent.AgentMain)- {AgentMain.startup-registration-failed-retry}A server appears to be up, but the agent is not yet registered. Will wait again. wait counter=[1] 2018-04-12 10:40:05,771 WARN [main] (org.rhq.enterprise.agent.AgentMain)- {AgentMain.startup-registration-failed-retry}A server appears to be up, but the agent is not yet registered. Will wait again. wait counter=[0] 2018-04-12 10:40:05,771 FATAL [main] (org.rhq.enterprise.agent.AgentMain)- {AgentMain.startup-error}The agent encountered an error during startup and must abort org.rhq.core.clientapi.server.core.AgentRegistrationException: The agent cannot register with the server. Admin intervention needed! at org.rhq.enterprise.agent.AgentMain.start(AgentMain.java:744) at org.rhq.enterprise.agent.AgentMain.main(AgentMain.java:461) 2018-04-12 10:40:05,802 INFO [main] (org.rhq.enterprise.agent.AgentMain)- {AgentMain.shutting-down}Agent is being shut down... 2018-04-12 10:40:05,865 INFO [main] (org.rhq.core.pc.PluginContainer)- Plugin container is already shut down. 2018-04-12 10:40:05,896 WARN [main] (org.rhq.enterprise.agent.AgentMain)- {AgentMain.failed-to-shutdown-component}Agent failed to shutdown component [Plugin Container]. Cause: java.lang.NullPointerException:null 2018-04-12 10:40:05,912 WARN [main] (org.rhq.enterprise.agent.AgentMain)- {AgentMain.server-shutdown-notification-failure}Agent failed to notify the server of the pending shutdown. Cause: org.rhq.enterprise.communications.command.server.AuthenticationException:Command failed to be authenticated! This command will be ignored and not processed: Command: type=[remotepojo]; cmd-in-response=[false]; config=[{rhq.agent-name=stend-db.com, rhq.externalizable-strategy=AGENT, rhq.timeout=10000}]; params=[{targetInterfaceName=org.rhq.core.clientapi.server.core.CoreServerService, invocation=NameBasedInvocation[agentIsShuttingDown]}] 2018-04-12 10:40:05,912 INFO [RHQ Server Polling Thread] (enterprise.communications.command.client.ServerPollingThread)- {ServerPollingThread.server-online}The server has come back online; client has been told to start sending commands again 2018-04-12 10:41:05,927 INFO [main] (org.rhq.enterprise.communications.ServiceContainer)- {ServiceContainer.shutting-down}Service container shutting down... 2018-04-12 10:41:05,927 INFO [main] (org.rhq.enterprise.communications.ServiceContainer)- {ServiceContainer.shutdown}Service container shut down - no longer accepting incoming commands 2018-04-12 10:41:05,927 INFO [main] (org.rhq.enterprise.agent.AgentMain)- {AgentMain.shut-down}Agent has been shut down 2018-04-12 10:41:05,927 FATAL [main] (org.rhq.enterprise.agent.AgentMain)- {AgentMain.start-failure}Failed to start the agent org.rhq.core.clientapi.server.core.AgentRegistrationException: The agent cannot register with the server. Admin intervention needed! at org.rhq.enterprise.agent.AgentMain.start(AgentMain.java:744) at org.rhq.enterprise.agent.AgentMain.main(AgentMain.java:461) At the same time on the server the following error is registered: 10:39:17,490 ERROR [org.rhq.enterprise.communications.command.client.ClientCommandSenderTask] (http-/0.0.0.0:80-42) {ClientCommandSenderTask.send-failed}Failed to send command [Command: type=[remotepojo]; cmd-in-response=[false]; config=[{rhq.timeout=10000, rhq.send-throttle=true}]; params=[{invocation=NameBasedInvocation[ping], targetInterfaceName=org.rhq.enterprise.communications.Ping}]]. Cause: java.util.concurrent.TimeoutException:null. Cause: java.util.concurrent.TimeoutException 10:40:06,206 WARN [org.rhq.enterprise.communications.command.server.CommandProcessor] (http-/0.0.0.0:80-41) {CommandProcessor.failed-authentication}Command failed to be authenticated! This command will be ignored and not processed: Command: type=[remotepojo]; cmd-in-response=[false]; config=[{rhq.agent-name=stend-db.com, rhq.externalizable-strategy=AGENT, rhq.timeout=10000}]; params=[{targetInterfaceName=org.rhq.core.clientapi.server.core.CoreServerService, invocation=NameBasedInvocation[agentIsShuttingDown]}] 10:40:37,491 ERROR [org.rhq.enterprise.communications.command.client.ClientCommandSenderTask] (http-/0.0.0.0:80-42) {ClientCommandSenderTask.send-failed}Failed to send command [Command: type=[remotepojo]; cmd-in-response=[false]; config=[{rhq.timeout=20000, rhq.send-throttle=true}]; params=[{invocation=NameBasedInvocation[ping], targetInterfaceName=org.rhq.enterprise.communications.Ping}]]. Cause: java.util.concurrent.TimeoutException:null. Cause: java.util.concurrent.TimeoutException 10:41:37,492 WARN [org.rhq.enterprise.server.core.CoreServerServiceImpl] (http-/0.0.0.0:80-42) org.rhq.core.clientapi.server.core.AgentRegistrationException: Server cannot ping the agent's endpoint. The agent's endpoint is probably invalid or there is a firewall preventing the server from connecting to the agent. Endpoint: socket://stend-db.com:16163/?rhq.communications.connector.rhqtype=agent&numAcceptThreads=1&maxPoolSize=303&clientMaxPoolSize=304&socketTimeout=60000&enableTcpNoDelay=true&backlog=200&generalizeSocketException=true Errors clearly indicate that there is a problem in the connection. However, from the side of the agent a server (ping, telnet 80) is available, and agent is available from the server (ping stend-db.com, telnet 16163 during agent startup) Version-Release number of selected component (if applicable): 4.13.1 server and 4.13.1 agent. How reproducible: I'm not sure how this happened, it just stopped working on the same day, before that a few years everything was ok. Additional info: I tried to reconfigure the agent with --fullcleanconfig, and also completely reinstall it, but nothing helped. On the server side, I deleted the agent information when reconfiguring him. Network administrators also confirmed that there is no equipment between the servers that would terminate the connection. OS server: CentOS Linux release 7.3.1611 (Core) kernel 3.10.0-514.6.1.el7.x86_64 OS agent: Windows Server 2008 R2 EE
Additional info: We tried to catch traffic to port 16163 on the server using tcpdump and saw strange sendings when the agent started: 11:27:11.201822 IP rhq.aplana.com.34262 > 90.101.168.191.isp.timbrasil.com.br.16163: Flags [S], seq 751041033, win 29200, options [mss 1460,sackOK,TS val 2225543221 ecr 0,nop,wscale 7], length 0 There are two strangenesses: 1. The IP address in the hosts file for the agent's DN: 192.168.101.90, not 191.168.101.90, but a couple of weeks ago he was listed there as: 191.168.101.90 2. If we look at tcpdump -n: The connection goes by IP: 191.168.101.90, but the address 90.101.168.191.isp.timbrasil.com.br through ping is not resolved on 191.168.101.90 3. 90.101.168.191.isp.timbrasil.com.br - is not a legitimate DN agent, its DN is ABSOLUTELY different (stend-db.com). It's not clear where this address came from. The connection initializes the java rhq server process. Adding to hosts: 192.168.101.90 90.101.168.191.isp.timbrasil.com.br Does not help The most interesting thing is that with --fullcleanconfig and specifying the agent address via IP 192.168.101.90 there are no errors, but through DN there is.