Bug 1420060 - Agent sometimes fails to register
Summary: Agent sometimes fails to register
Keywords:
Status: CLOSED WORKSFORME
Alias: None
Product: JBoss Operations Network
Classification: JBoss
Component: Agent
Version: JON 3.3.8
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ER01
: JON 3.3.10
Assignee: Josejulio Martínez
QA Contact: Mike Foley
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2017-02-07 16:54 UTC by Filip Brychta
Modified: 2017-11-30 15:37 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2017-11-30 15:37:01 UTC
Type: Bug
Embargoed:


Attachments (Terms of Use)
agent log (55.30 KB, text/plain)
2017-02-07 16:54 UTC, Filip Brychta
no flags Details
server log (320.54 KB, text/plain)
2017-02-07 16:54 UTC, Filip Brychta
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Bugzilla 1152154 0 high CLOSED have option to allow agent to wait forever for registration server to come online 2021-02-22 00:41:40 UTC

Internal Links: 1152154

Description Filip Brychta 2017-02-07 16:54:14 UTC
Created attachment 1248471 [details]
agent log

Description of problem:
Agent is trying to register but fails and is stopped. This is happening only sometimes ~ 2 out of 10 tries.

Version-Release number of selected component (if applicable):
JON3.3.8.ER02 but probably on earlier versions as well

How reproducible:
20%

Steps to Reproduce:
1. install and start JON 3.3.8
2. try to register new agent


Actual results:
Agent is trying to register:
A server appears to be up, but the agent is not yet registered. Will wait again. wait counter=[4]
.
.
.

but eventually fails:
2017-02-07 11:17:44,890 FATAL [main] (org.rhq.enterprise.agent.AgentMain)- {AgentMain.startup-error}The agent encountered an error during startup and must abort

Corresponding msgs from server.log:
11:17:44,975 WARN  [org.rhq.enterprise.communications.command.server.CommandProcessor] (http-0.0.0.0:7080-8) {CommandProcessor.failed-authentication}Command failed to be authenticated!  This command will be ignored and not processed: Command: type=[remotepojo]; cmd-in-response=[false]; config=[{rhq.agent-name=fbr-ag-rpm6.bc.jonqe.lab.eng.bos.redhat.com, rhq.externalizable-strategy=AGENT, rhq.timeout=10000}]; params=[{targetInterfaceName=org.rhq.core.clientapi.server.core.CoreServerService, invocation=NameBasedInvocation[agentIsShuttingDown]}]



Expected results:
Agent is registered correctly

Additional info:
It seems that the agent is not sleeping correctly ( ignoring 'will sleep for [60,000] milliseconds') and is trying to register again too soon.

See timestamp of first try:
2017-02-07 11:17:44,886 DEBUG [main] (org.rhq.enterprise.agent.AgentMain)- {AgentMain.waiting-for-server}The agent is waiting for the server - will sleep for [60,000] milliseconds

and timestamp of the failure:
2017-02-07 11:17:44,890 FATAL [main] (org.rhq.enterprise.agent.AgentMain)- {AgentMain.startup-error}The agent encountered an error during startup and must abort


When I tried to start the agent again it failed because of missing sec token:
"is attempting to re-register without a security token"

Agent and server.log attached.

Comment 1 Filip Brychta 2017-02-07 16:54:47 UTC
Created attachment 1248472 [details]
server log

Comment 2 Michael Burman 2017-04-04 12:59:09 UTC
I think the real issue is the socket connection here.

2017-02-07 11:16:46,203 DEBUG [WorkerThread#0[10.16.23.93:35254]] (enterprise.communications.command.server.CommandProcessor)- {CommandProcessor.executed}Executed command - response is [Command Response: isSuccessful=[true]; command=[null]; results=[]; exception=[>> exception was null <<]]
2017-02-07 11:16:46,233 DEBUG [WorkerThread#0[10.16.23.93:35254]] (jboss.remoting.transport.socket.ServerThread)- WorkerThread#0[10.16.23.93:35254] closed socketWrapper: ServerSocketWrapper[Socket[addr=/10.16.23.93,port=35254,localport=16163].5969b696]

The sleep does work correctly (it stops sleeping immediately after the client<->server communication is working), but the state of the socket is wrong -> it believes it is still alive.

Comment 4 Josejulio Martínez 2017-08-17 22:42:56 UTC
Isn't this a race condition?

On the main thread, the server is know to be up, but the agent needs to register. So it should be waiting for registration, instead it does wait for the server.
However the server is up and it triggers the same path again without delay.
It only tries 5 times and then stops.

Right at the bottom we can see:
2017-02-07 11:19:25,267 DEBUG [RHQ Agent Registration Thread] (org.rhq.enterprise.agent.AgentMain)- {AgentMain.agent-registration-attempt}Agent will now attempt to register with the server [AgentRegistrationRequest: [name=[fbr-ag-rpm6.bc.jonqe.lab.eng.bos.redhat.com]; address=[10.16.23.91]; port=[16163]; remote-endpoint=[socket://10.16.23.91:16163/?

So RHQ Agent Registration Thread starts registering after we already decided to stop. (The agent cannot register with the server. Admin intervention needed!).

As a side note, it appears like the agent won't be able to register anyway:

2017-02-07 11:19:25,398 ERROR [RHQ Agent Registration Thread] (org.rhq.enterprise.agent.AgentMain)- {AgentMain.agent-registration-rejected}The server has rejected the agent registration request. Cause: [org.rhq.core.clientapi.server.core.AgentRegistrationException:The agent [fbr-ag-rpm6.bc.jonqe.lab.eng.bos.redhat.com] is attempting to re-register without a security token. Please consult an administrator to obtain the agent's proper security token and restart the agent with the option "-Drhq.agent.security-token=<the valid security token>". An administrator can find the agent's security token by navigating to the GUI page "Administration (Topology) > Agents" and drilling down to this specific agent. You will see the long security token string there. For more information, read: https://docs.jboss.org/author/display/RHQ/Agent+Registration]

Comment 5 Josejulio Martínez 2017-08-18 16:30:44 UTC
It looks like server is starting, agent can ping it, but the register mechanism isn't ready yet to accept registration.

Main thread doesn't verify if agent is able to reach the registration remote, it only check if is able to reach the server.
Main thread will wait at most 60s [1] for registration to happen and continue.
If registration takes more than 60s (possibly because registration is waiting for the registration remote to be available) it will continue, reach the "A server appears to be up, but the agent is not yet registered." code [2], as server is on, it won't wait thus stopping with the error after 5 tries.

One solution (at this point) on top of my head is to sleep (additionally to waiting for server).

[1] https://github.com/josejulio/rhq/blob/7daa7fd320b32127fabc21fe2d596d4df708390c/modules/enterprise/agent/src/main/java/org/rhq/enterprise/agent/AgentMain.java#L3912
[2] https://github.com/josejulio/rhq/blob/7daa7fd320b32127fabc21fe2d596d4df708390c/modules/enterprise/agent/src/main/java/org/rhq/enterprise/agent/AgentMain.java#L745

Comment 6 Filip Brychta 2017-08-21 08:28:36 UTC
(In reply to Josejulio Martínez from comment #5)
> It looks like server is starting, agent can ping it, but the register
> mechanism isn't ready yet to accept registration.
> 
The issue was visible even when the server was fully up. If you try to register more agents at the same time, some of them might succeed some of them might not so it doesn't seem to be related to server starting state.

Comment 7 Josejulio Martínez 2017-08-21 17:00:32 UTC
Thanks Filip.

In the logs it seems like the agent can't connect to the server (It waits up to 60 secs to finish registration process).
Anyhow is still possible that once the server is available, registration thread is sleeping [1] before doing an attempt to register. 
If that's the case (i couldn't reproduce it) then this fix [2] should do the trick, waiting between checking if the server is available and the agent registered.


[1] https://github.com/josejulio/rhq/blob/6899a0dcb8d1383b6ee227452408b856b21d37c9/modules/enterprise/agent/src/main/java/org/rhq/enterprise/agent/AgentMain.java#L1540
[2] https://github.com/rhq-project/rhq/pull/316

Comment 8 Michael Burman 2017-08-24 15:09:46 UTC
As there's no customer issue linked to this, could we postpone it from 3.3.9 until we can reliably reproduce it or customer hits it? 

I'm not really fond of adding additional sleeps and guesses to the code as we might actually break more than we fix in that case.

Comment 10 Filip Brychta 2017-08-25 11:34:53 UTC
I'm not aware of any reliable reproducer and I'm not able to reproduce this again using our current test environment so it's not blocking our testing now. But it was happening quite often on our previous test env. I was hoping that additional sleep would be safe fix but if you think it's not, lets push it to 3.3.10 and possibly close it then unless I hit this issue again.

Comment 11 Larry O'Leary 2017-08-25 13:04:31 UTC
I will need to review the issue more in depth as it relates to customer cases.

Perhaps this is an issue with concurrency limit at registration time?

In any event, it may be safer to move this to 3.3.10 and spend more time investigating. Even if we cannot find customer cases that identify this issue it will still give us more time to properly review the code execution path and alternatives that do not equate to sleeping without knowing why.

I'll put it in 3.3.10 and clear the triage flag. If anyone objects, please speak up.

Comment 14 Josejulio Martínez 2017-11-17 19:29:51 UTC
I tried to reproduce this but could not, checked the code but nothing obvious there on what could be wrong. 
I could only make guesses but they are not safe (sleeps that might break something else) and not able to test if it fix the issue or not.

@Filip, have you seen this issue lately?

Comment 15 Filip Brychta 2017-11-20 10:02:12 UTC
Last time I saw this was on older infrastructure. I have never seen it on our new test infrastructure. We will discuss it during next triage call to suggest next steps.

Comment 16 Filip Brychta 2017-11-30 15:37:01 UTC
Triage: Larry, Simeon, Filip - neither dev or qa are able to reproduce -> closing as works for me


Note You need to log in before you can comment on or make changes to this bug.