Bug 534414 - (RHQ-1212) have the agent restart itself if it hits a FATAL error
have the agent restart itself if it hits a FATAL error
Product: RHQ Project
Classification: Other
Component: Agent (Show other bugs)
All All
high Severity medium (vote)
: ---
: ---
Assigned To: John Mazzitelli
Corey Welton
: Improvement
Depends On:
  Show dependency treegraph
Reported: 2008-12-04 01:46 EST by John Mazzitelli
Modified: 2008-12-18 10:28 EST (History)
0 users

See Also:
Fixed In Version: 1.2
Doc Type: Enhancement
Doc Text:
Story Points: ---
Clone Of:
Last Closed:
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---

Attachments (Terms of Use)

  None (edit)
Description John Mazzitelli 2008-12-04 01:46:00 EST
Rather than the agent sitting dead in the water when it hits a FATAL error, it should at least restart itself after about 1 min just to try again. No sense it sitting there doing nothing, let it try to recover.

In the case of RHQ-1211, I think if it just restarted, and re-updated the plugins, it probably would have continued to operate fine.
Comment 1 John Mazzitelli 2008-12-04 16:33:04 EST
if an agent registration exception occurs, we'll retry up to 5 times, 60 seconds apart.  This hopefully helps when the agent's box is loaded down and the agent cannot respond to a ping fast enough.  Hopefully, the agent box will be stable within 5 mins... if not, we assume the error is permanent and give up - the agent process remains running, it doesn't die, but the agent process is effectively useless. might want to consider doing something in this case, even a System.exit() would be appropriate here because the agent can't do anything unless it was registered

In addition, if the plugin container fails to start, we shutdown and restart the agent up to 5 times. If we still can't get the PC to start, we will assume the error is permanent and stop - the agent will actually exit in this case.
Comment 2 Corey Welton 2008-12-18 10:25:23 EST
QA Verified, this occurs correctly:

Easy way to test this --  start an agent, and then attempt to run another instance of the same agent.  You'll get the fatal error about the Address already being in use, and the agent will attempt to reload itself five times before finally giving up.

Comment 3 John Mazzitelli 2008-12-18 10:28:24 EST
and if you killed that first one in the middle of those retries, your second agent should eventually start up successfully during one  of those 5 retries.
Comment 4 Red Hat Bugzilla 2009-11-10 15:28:11 EST
This bug was previously known as http://jira.rhq-project.org/browse/RHQ-1212

Note You need to log in before you can comment on or make changes to this bug.