Red Hat Bugzilla – Bug 534414
have the agent restart itself if it hits a FATAL error
Last modified: 2008-12-18 10:28:00 EST
Rather than the agent sitting dead in the water when it hits a FATAL error, it should at least restart itself after about 1 min just to try again. No sense it sitting there doing nothing, let it try to recover.
In the case of RHQ-1211, I think if it just restarted, and re-updated the plugins, it probably would have continued to operate fine.
if an agent registration exception occurs, we'll retry up to 5 times, 60 seconds apart. This hopefully helps when the agent's box is loaded down and the agent cannot respond to a ping fast enough. Hopefully, the agent box will be stable within 5 mins... if not, we assume the error is permanent and give up - the agent process remains running, it doesn't die, but the agent process is effectively useless. might want to consider doing something in this case, even a System.exit() would be appropriate here because the agent can't do anything unless it was registered
In addition, if the plugin container fails to start, we shutdown and restart the agent up to 5 times. If we still can't get the PC to start, we will assume the error is permanent and stop - the agent will actually exit in this case.
QA Verified, this occurs correctly:
Easy way to test this -- start an agent, and then attempt to run another instance of the same agent. You'll get the fatal error about the Address already being in use, and the agent will attempt to reload itself five times before finally giving up.
and if you killed that first one in the middle of those retries, your second agent should eventually start up successfully during one of those 5 retries.
This bug was previously known as http://jira.rhq-project.org/browse/RHQ-1212