Bug 1566378 - RHQ server returns an error of no connection, even when there are really no problems with the connection
Summary: RHQ server returns an error of no connection, even when there are really no p...
Keywords:
Status: NEW
Alias: None
Product: RHQ Project
Classification: Other
Component: Agent
Version: 4.13
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
: ---
Assignee: Nobody
QA Contact: Mike Foley
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-04-12 08:08 UTC by destros
Modified: 2020-04-27 01:30 UTC (History)
1 user (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:


Attachments (Terms of Use)

Description destros 2018-04-12 08:08:13 UTC
Description of problem:

In one network is the RHQ server and the machine on which the agent is installed.
When i'm run the agent, in the logs the error is:

2018-04-12 10:39:04,412 INFO  [main] (org.rhq.enterprise.agent.AgentMain)- {AgentMain.identify-version}Version=[RHQ 4.13.1], Build Number=[f37fe58], Build Date=[Jan 16, 2015 7:56 PM]
2018-04-12 10:39:04,771 INFO  [main] (org.rhq.enterprise.communications.ServiceContainer)- {ServiceContainer.global-concurrency-limit-disabled}Global concurrency limit has been disabled - there is no limit to the number of incoming commands allowed
2018-04-12 10:39:05,677 INFO  [main] (org.rhq.enterprise.communications.ServiceContainer)- {ServiceContainer.started}Service container started - ready to accept incoming commands
2018-04-12 10:39:05,693 INFO  [main] (org.rhq.enterprise.agent.AgentMain)- {AgentMain.no-auto-detect}Server auto-detection is not enabled - starting the poller immediately
2018-04-12 10:40:05,771 WARN  [main] (org.rhq.enterprise.agent.AgentMain)- {AgentMain.startup-registration-failed-retry}A server appears to be up, but the agent is not yet registered. Will wait again. wait counter=[4]
2018-04-12 10:40:05,771 WARN  [main] (org.rhq.enterprise.agent.AgentMain)- {AgentMain.startup-registration-failed-retry}A server appears to be up, but the agent is not yet registered. Will wait again. wait counter=[3]
2018-04-12 10:40:05,771 WARN  [main] (org.rhq.enterprise.agent.AgentMain)- {AgentMain.startup-registration-failed-retry}A server appears to be up, but the agent is not yet registered. Will wait again. wait counter=[2]
2018-04-12 10:40:05,771 WARN  [main] (org.rhq.enterprise.agent.AgentMain)- {AgentMain.startup-registration-failed-retry}A server appears to be up, but the agent is not yet registered. Will wait again. wait counter=[1]
2018-04-12 10:40:05,771 WARN  [main] (org.rhq.enterprise.agent.AgentMain)- {AgentMain.startup-registration-failed-retry}A server appears to be up, but the agent is not yet registered. Will wait again. wait counter=[0]
2018-04-12 10:40:05,771 FATAL [main] (org.rhq.enterprise.agent.AgentMain)- {AgentMain.startup-error}The agent encountered an error during startup and must abort
org.rhq.core.clientapi.server.core.AgentRegistrationException: The agent cannot register with the server. Admin intervention needed!
	at org.rhq.enterprise.agent.AgentMain.start(AgentMain.java:744)
	at org.rhq.enterprise.agent.AgentMain.main(AgentMain.java:461)
2018-04-12 10:40:05,802 INFO  [main] (org.rhq.enterprise.agent.AgentMain)- {AgentMain.shutting-down}Agent is being shut down...
2018-04-12 10:40:05,865 INFO  [main] (org.rhq.core.pc.PluginContainer)- Plugin container is already shut down.
2018-04-12 10:40:05,896 WARN  [main] (org.rhq.enterprise.agent.AgentMain)- {AgentMain.failed-to-shutdown-component}Agent failed to shutdown component [Plugin Container]. Cause: java.lang.NullPointerException:null
2018-04-12 10:40:05,912 WARN  [main] (org.rhq.enterprise.agent.AgentMain)- {AgentMain.server-shutdown-notification-failure}Agent failed to notify the server of the pending shutdown. Cause: org.rhq.enterprise.communications.command.server.AuthenticationException:Command failed to be authenticated!  This command will be ignored and not processed: Command: type=[remotepojo]; cmd-in-response=[false]; config=[{rhq.agent-name=stend-db.com, rhq.externalizable-strategy=AGENT, rhq.timeout=10000}]; params=[{targetInterfaceName=org.rhq.core.clientapi.server.core.CoreServerService, invocation=NameBasedInvocation[agentIsShuttingDown]}]
2018-04-12 10:40:05,912 INFO  [RHQ Server Polling Thread] (enterprise.communications.command.client.ServerPollingThread)- {ServerPollingThread.server-online}The server has come back online; client has been told to start sending commands again
2018-04-12 10:41:05,927 INFO  [main] (org.rhq.enterprise.communications.ServiceContainer)- {ServiceContainer.shutting-down}Service container shutting down...
2018-04-12 10:41:05,927 INFO  [main] (org.rhq.enterprise.communications.ServiceContainer)- {ServiceContainer.shutdown}Service container shut down - no longer accepting incoming commands
2018-04-12 10:41:05,927 INFO  [main] (org.rhq.enterprise.agent.AgentMain)- {AgentMain.shut-down}Agent has been shut down
2018-04-12 10:41:05,927 FATAL [main] (org.rhq.enterprise.agent.AgentMain)- {AgentMain.start-failure}Failed to start the agent
org.rhq.core.clientapi.server.core.AgentRegistrationException: The agent cannot register with the server. Admin intervention needed!
	at org.rhq.enterprise.agent.AgentMain.start(AgentMain.java:744)
	at org.rhq.enterprise.agent.AgentMain.main(AgentMain.java:461)


At the same time on the server the following error is registered:

10:39:17,490 ERROR [org.rhq.enterprise.communications.command.client.ClientCommandSenderTask] (http-/0.0.0.0:80-42) {ClientCommandSenderTask.send-failed}Failed to send command [Command: type=[remotepojo]; cmd-in-response=[false]; config=[{rhq.timeout=10000, rhq.send-throttle=true}]; params=[{invocation=NameBasedInvocation[ping], targetInterfaceName=org.rhq.enterprise.communications.Ping}]]. Cause: java.util.concurrent.TimeoutException:null. Cause: java.util.concurrent.TimeoutException
10:40:06,206 WARN  [org.rhq.enterprise.communications.command.server.CommandProcessor] (http-/0.0.0.0:80-41) {CommandProcessor.failed-authentication}Command failed to be authenticated!  This command will be ignored and not processed: Command: type=[remotepojo]; cmd-in-response=[false]; config=[{rhq.agent-name=stend-db.com, rhq.externalizable-strategy=AGENT, rhq.timeout=10000}]; params=[{targetInterfaceName=org.rhq.core.clientapi.server.core.CoreServerService, invocation=NameBasedInvocation[agentIsShuttingDown]}]
10:40:37,491 ERROR [org.rhq.enterprise.communications.command.client.ClientCommandSenderTask] (http-/0.0.0.0:80-42) {ClientCommandSenderTask.send-failed}Failed to send command [Command: type=[remotepojo]; cmd-in-response=[false]; config=[{rhq.timeout=20000, rhq.send-throttle=true}]; params=[{invocation=NameBasedInvocation[ping], targetInterfaceName=org.rhq.enterprise.communications.Ping}]]. Cause: java.util.concurrent.TimeoutException:null. Cause: java.util.concurrent.TimeoutException
10:41:37,492 WARN  [org.rhq.enterprise.server.core.CoreServerServiceImpl] (http-/0.0.0.0:80-42) org.rhq.core.clientapi.server.core.AgentRegistrationException: Server cannot ping the agent's endpoint. The agent's endpoint is probably invalid or there is a firewall preventing the server from connecting to the agent. Endpoint: socket://stend-db.com:16163/?rhq.communications.connector.rhqtype=agent&numAcceptThreads=1&maxPoolSize=303&clientMaxPoolSize=304&socketTimeout=60000&enableTcpNoDelay=true&backlog=200&generalizeSocketException=true

Errors clearly indicate that there is a problem in the connection. However, from the side of the agent a server (ping, telnet 80) is available, and agent is available from the server (ping stend-db.com, telnet 16163 during agent startup)


Version-Release number of selected component (if applicable):
4.13.1 server and 4.13.1 agent.


How reproducible:
I'm not sure how this happened, it just stopped working on the same day, before that a few years everything was ok.


Additional info:
I tried to reconfigure the agent with --fullcleanconfig, and also completely reinstall it, but nothing helped.
On the server side, I deleted the agent information when reconfiguring him.
Network administrators also confirmed that there is no equipment between the servers that would terminate the connection.
OS server: CentOS Linux release 7.3.1611 (Core) kernel 3.10.0-514.6.1.el7.x86_64
OS agent: Windows Server 2008 R2 EE

Comment 1 destros 2018-04-26 09:57:43 UTC
Additional info:
We tried to catch traffic to port 16163 on the server using tcpdump and saw strange sendings when the agent started:
11:27:11.201822 IP rhq.aplana.com.34262 > 90.101.168.191.isp.timbrasil.com.br.16163: Flags [S], seq 751041033, win 29200, options [mss 1460,sackOK,TS val 2225543221 ecr 0,nop,wscale 7], length 0

There are two strangenesses:
1. The IP address in the hosts file for the agent's DN: 192.168.101.90, not 191.168.101.90, but a couple of weeks ago he was listed there as: 191.168.101.90
2. If we look at tcpdump -n: The connection goes by IP: 191.168.101.90, but the address 90.101.168.191.isp.timbrasil.com.br through ping is not resolved on 191.168.101.90
3. 90.101.168.191.isp.timbrasil.com.br - is not a legitimate DN agent, its DN is ABSOLUTELY different (stend-db.com). It's not clear where this address came from. The connection initializes the java rhq server process.

Adding to hosts:
192.168.101.90 90.101.168.191.isp.timbrasil.com.br
Does not help

The most interesting thing is that with --fullcleanconfig and specifying the agent address via IP 192.168.101.90 there are no errors, but through DN there is.


Note You need to log in before you can comment on or make changes to this bug.