Bug 534414 (RHQ-1212) - have the agent restart itself if it hits a FATAL error
Summary: have the agent restart itself if it hits a FATAL error
Keywords:
Status: CLOSED NEXTRELEASE
Alias: RHQ-1212
Product: RHQ Project
Classification: Other
Component: Agent
Version: unspecified
Hardware: All
OS: All
high
medium
Target Milestone: ---
: ---
Assignee: John Mazzitelli
QA Contact: Corey Welton
URL: http://jira.rhq-project.org/browse/RH...
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2008-12-04 06:46 UTC by John Mazzitelli
Modified: 2008-12-18 15:28 UTC (History)
0 users

Fixed In Version: 1.2
Doc Type: Enhancement
Doc Text:
Clone Of:
Environment:
Last Closed:
Embargoed:


Attachments (Terms of Use)

Description John Mazzitelli 2008-12-04 06:46:00 UTC
Rather than the agent sitting dead in the water when it hits a FATAL error, it should at least restart itself after about 1 min just to try again. No sense it sitting there doing nothing, let it try to recover.

In the case of RHQ-1211, I think if it just restarted, and re-updated the plugins, it probably would have continued to operate fine.

Comment 1 John Mazzitelli 2008-12-04 21:33:04 UTC
if an agent registration exception occurs, we'll retry up to 5 times, 60 seconds apart.  This hopefully helps when the agent's box is loaded down and the agent cannot respond to a ping fast enough.  Hopefully, the agent box will be stable within 5 mins... if not, we assume the error is permanent and give up - the agent process remains running, it doesn't die, but the agent process is effectively useless. might want to consider doing something in this case, even a System.exit() would be appropriate here because the agent can't do anything unless it was registered

In addition, if the plugin container fails to start, we shutdown and restart the agent up to 5 times. If we still can't get the PC to start, we will assume the error is permanent and stop - the agent will actually exit in this case.

Comment 2 Corey Welton 2008-12-18 15:25:23 UTC
QA Verified, this occurs correctly:

Easy way to test this --  start an agent, and then attempt to run another instance of the same agent.  You'll get the fatal error about the Address already being in use, and the agent will attempt to reload itself five times before finally giving up.



Comment 3 John Mazzitelli 2008-12-18 15:28:24 UTC
and if you killed that first one in the middle of those retries, your second agent should eventually start up successfully during one  of those 5 retries.

Comment 4 Red Hat Bugzilla 2009-11-10 20:28:11 UTC
This bug was previously known as http://jira.rhq-project.org/browse/RHQ-1212



Note You need to log in before you can comment on or make changes to this bug.