Bug 808648

Summary: Max retries should be configurable in org.rhq.enterprise.agent.AgentMain
Product: [JBoss] JBoss Operations Network Reporter: David van Balen <dvanbale>
Component: AgentAssignee: RHQ Project Maintainer <rhq-maint>
Status: CLOSED NOTABUG QA Contact: Mike Foley <mfoley>
Severity: low Docs Contact:
Priority: unspecified    
Version: JON 3.0.0CC: dvanbale, mazz, myarboro
Target Milestone: ---   
Target Release: JON 3.2.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2013-03-04 23:48:49 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description David van Balen 2012-03-31 02:27:59 UTC
Description of problem: When a JON Agent is started and the JON Server is unavailable, the main method of the AgentMain class (located in rhq-agent/lib/rhq-enterprise-agent-4.2.0.JON300.GA.jar) has a hardcoded while(retries++ < 5) and Thread.sleep(60000L), disallowing any more or less than 5 retry attempts over a span of 5 minutes.


Version-Release number of selected component (if applicable): JON 3.0.0.GA and Agent 4.2.0.JON300.GA


How reproducible: Always


Steps to Reproduce:
1. Make sure any JON server the agent is configured to register with is shut down.
2. Start the agent with java -jar test-rhq-enterprise-agent-4.2.0.JON300.GA.jar --install --launch=true
3. Wait more than 5 minutes before starting JON server
  
Actual results: Agent has given up retrying to register with the server by the time the server becomes available.


Expected results: There should be a way to tell the agent to continue retrying to connect for more than 5 minutes (I.e. 5 tries with 1 minute pauses in between).


Additional info: I get different results when starting the agent from the command line or as a service. In those cases, I get get something like:

2012-03-30 22:11:11,730 INFO  [main] (org.rhq.enterprise.agent.AgentMain)- {AgentMain.waiting-to-be-registered-begin}The agent will now wait until it has registered with the server...

Comment 1 David van Balen 2012-03-31 03:49:30 UTC
Looks like there was actually an error in the configuration file I was passing the installer (forgot to mention I was passing a config file by setting RHQ_AGENT_CMDLINE_OPTS="-c path-to-config-file"), although I'm not certain what the problem was. Now that I've corrected it, the agent continues to retry connecting to the server beyond the hardcoded 5 tries. Since the error I was seeing is likely to have been a non-recoverable error, I'm not sure this bug is still valid. I'll leave it up to the RHQ team to decide.

Comment 2 Mike Foley 2012-04-02 16:00:08 UTC
set priority per BZ triage 4/2/2012 (crocuh, loleary, mfoley, asantos)

Comment 3 mark yarborough 2013-03-04 23:13:34 UTC
Mazz, is this configurable?

Comment 4 John Mazzitelli 2013-03-04 23:46:46 UTC
I don't think the agent completely gives up here. Only under rare conditions will the agent ever completely just stop retrying (IIRC, only if it gets a registration error like "missing token, cannot register" or something fatal like that).

The agent has been designed/implemented to stay running indefinitely - that is, it should wait indefinitely for a server to come online.

So, after your step #3 in the replication procedure, what happens when you DO start the server?

In other words, I would only consider a problem to exist here if you added a step #4 "Start the JON Server" and then the agent NEVER registers and connects.

So, what happens when you try that? And what does your agent log file say after starting the JON Server? You may have to wait an addition number of seconds after starting the JON Server before the agent actually registers/connects.

Comment 5 John Mazzitelli 2013-03-04 23:48:49 UTC
(In reply to comment #1)
> Looks like there was actually an error in the configuration file I was
> passing the installer (forgot to mention I was passing a config file by
> setting RHQ_AGENT_CMDLINE_OPTS="-c path-to-config-file"), although I'm not
> certain what the problem was. Now that I've corrected it, the agent
> continues to retry connecting to the server beyond the hardcoded 5 tries.
> Since the error I was seeing is likely to have been a non-recoverable error,
> I'm not sure this bug is still valid. I'll leave it up to the RHQ team to
> decide.

Ahh.. I didn't read this comment.

Right - this is what I was saying in my last comment. The agent will always retry UNLESS there is some fatal error at startup that simply would cause the agent to never be able to register/connect. A bad startup config might be such a case.

I do not consider this bug to be valid - I would close it.