Created attachment 1248471 [details] agent log Description of problem: Agent is trying to register but fails and is stopped. This is happening only sometimes ~ 2 out of 10 tries. Version-Release number of selected component (if applicable): JON3.3.8.ER02 but probably on earlier versions as well How reproducible: 20% Steps to Reproduce: 1. install and start JON 3.3.8 2. try to register new agent Actual results: Agent is trying to register: A server appears to be up, but the agent is not yet registered. Will wait again. wait counter=[4] . . . but eventually fails: 2017-02-07 11:17:44,890 FATAL [main] (org.rhq.enterprise.agent.AgentMain)- {AgentMain.startup-error}The agent encountered an error during startup and must abort Corresponding msgs from server.log: 11:17:44,975 WARN [org.rhq.enterprise.communications.command.server.CommandProcessor] (http-0.0.0.0:7080-8) {CommandProcessor.failed-authentication}Command failed to be authenticated! This command will be ignored and not processed: Command: type=[remotepojo]; cmd-in-response=[false]; config=[{rhq.agent-name=fbr-ag-rpm6.bc.jonqe.lab.eng.bos.redhat.com, rhq.externalizable-strategy=AGENT, rhq.timeout=10000}]; params=[{targetInterfaceName=org.rhq.core.clientapi.server.core.CoreServerService, invocation=NameBasedInvocation[agentIsShuttingDown]}] Expected results: Agent is registered correctly Additional info: It seems that the agent is not sleeping correctly ( ignoring 'will sleep for [60,000] milliseconds') and is trying to register again too soon. See timestamp of first try: 2017-02-07 11:17:44,886 DEBUG [main] (org.rhq.enterprise.agent.AgentMain)- {AgentMain.waiting-for-server}The agent is waiting for the server - will sleep for [60,000] milliseconds and timestamp of the failure: 2017-02-07 11:17:44,890 FATAL [main] (org.rhq.enterprise.agent.AgentMain)- {AgentMain.startup-error}The agent encountered an error during startup and must abort When I tried to start the agent again it failed because of missing sec token: "is attempting to re-register without a security token" Agent and server.log attached.
Created attachment 1248472 [details] server log
I think the real issue is the socket connection here. 2017-02-07 11:16:46,203 DEBUG [WorkerThread#0[10.16.23.93:35254]] (enterprise.communications.command.server.CommandProcessor)- {CommandProcessor.executed}Executed command - response is [Command Response: isSuccessful=[true]; command=[null]; results=[]; exception=[>> exception was null <<]] 2017-02-07 11:16:46,233 DEBUG [WorkerThread#0[10.16.23.93:35254]] (jboss.remoting.transport.socket.ServerThread)- WorkerThread#0[10.16.23.93:35254] closed socketWrapper: ServerSocketWrapper[Socket[addr=/10.16.23.93,port=35254,localport=16163].5969b696] The sleep does work correctly (it stops sleeping immediately after the client<->server communication is working), but the state of the socket is wrong -> it believes it is still alive.
Isn't this a race condition? On the main thread, the server is know to be up, but the agent needs to register. So it should be waiting for registration, instead it does wait for the server. However the server is up and it triggers the same path again without delay. It only tries 5 times and then stops. Right at the bottom we can see: 2017-02-07 11:19:25,267 DEBUG [RHQ Agent Registration Thread] (org.rhq.enterprise.agent.AgentMain)- {AgentMain.agent-registration-attempt}Agent will now attempt to register with the server [AgentRegistrationRequest: [name=[fbr-ag-rpm6.bc.jonqe.lab.eng.bos.redhat.com]; address=[10.16.23.91]; port=[16163]; remote-endpoint=[socket://10.16.23.91:16163/? So RHQ Agent Registration Thread starts registering after we already decided to stop. (The agent cannot register with the server. Admin intervention needed!). As a side note, it appears like the agent won't be able to register anyway: 2017-02-07 11:19:25,398 ERROR [RHQ Agent Registration Thread] (org.rhq.enterprise.agent.AgentMain)- {AgentMain.agent-registration-rejected}The server has rejected the agent registration request. Cause: [org.rhq.core.clientapi.server.core.AgentRegistrationException:The agent [fbr-ag-rpm6.bc.jonqe.lab.eng.bos.redhat.com] is attempting to re-register without a security token. Please consult an administrator to obtain the agent's proper security token and restart the agent with the option "-Drhq.agent.security-token=<the valid security token>". An administrator can find the agent's security token by navigating to the GUI page "Administration (Topology) > Agents" and drilling down to this specific agent. You will see the long security token string there. For more information, read: https://docs.jboss.org/author/display/RHQ/Agent+Registration]
It looks like server is starting, agent can ping it, but the register mechanism isn't ready yet to accept registration. Main thread doesn't verify if agent is able to reach the registration remote, it only check if is able to reach the server. Main thread will wait at most 60s [1] for registration to happen and continue. If registration takes more than 60s (possibly because registration is waiting for the registration remote to be available) it will continue, reach the "A server appears to be up, but the agent is not yet registered." code [2], as server is on, it won't wait thus stopping with the error after 5 tries. One solution (at this point) on top of my head is to sleep (additionally to waiting for server). [1] https://github.com/josejulio/rhq/blob/7daa7fd320b32127fabc21fe2d596d4df708390c/modules/enterprise/agent/src/main/java/org/rhq/enterprise/agent/AgentMain.java#L3912 [2] https://github.com/josejulio/rhq/blob/7daa7fd320b32127fabc21fe2d596d4df708390c/modules/enterprise/agent/src/main/java/org/rhq/enterprise/agent/AgentMain.java#L745
(In reply to Josejulio Martínez from comment #5) > It looks like server is starting, agent can ping it, but the register > mechanism isn't ready yet to accept registration. > The issue was visible even when the server was fully up. If you try to register more agents at the same time, some of them might succeed some of them might not so it doesn't seem to be related to server starting state.
Thanks Filip. In the logs it seems like the agent can't connect to the server (It waits up to 60 secs to finish registration process). Anyhow is still possible that once the server is available, registration thread is sleeping [1] before doing an attempt to register. If that's the case (i couldn't reproduce it) then this fix [2] should do the trick, waiting between checking if the server is available and the agent registered. [1] https://github.com/josejulio/rhq/blob/6899a0dcb8d1383b6ee227452408b856b21d37c9/modules/enterprise/agent/src/main/java/org/rhq/enterprise/agent/AgentMain.java#L1540 [2] https://github.com/rhq-project/rhq/pull/316
As there's no customer issue linked to this, could we postpone it from 3.3.9 until we can reliably reproduce it or customer hits it? I'm not really fond of adding additional sleeps and guesses to the code as we might actually break more than we fix in that case.
I'm not aware of any reliable reproducer and I'm not able to reproduce this again using our current test environment so it's not blocking our testing now. But it was happening quite often on our previous test env. I was hoping that additional sleep would be safe fix but if you think it's not, lets push it to 3.3.10 and possibly close it then unless I hit this issue again.
I will need to review the issue more in depth as it relates to customer cases. Perhaps this is an issue with concurrency limit at registration time? In any event, it may be safer to move this to 3.3.10 and spend more time investigating. Even if we cannot find customer cases that identify this issue it will still give us more time to properly review the code execution path and alternatives that do not equate to sleeping without knowing why. I'll put it in 3.3.10 and clear the triage flag. If anyone objects, please speak up.
I tried to reproduce this but could not, checked the code but nothing obvious there on what could be wrong. I could only make guesses but they are not safe (sleeps that might break something else) and not able to test if it fix the issue or not. @Filip, have you seen this issue lately?
Last time I saw this was on older infrastructure. I have never seen it on our new test infrastructure. We will discuss it during next triage call to suggest next steps.
Triage: Larry, Simeon, Filip - neither dev or qa are able to reproduce -> closing as works for me