534414 – (RHQ-1212) have the agent restart itself if it hits a FATAL error

Bug 534414 (RHQ-1212) - have the agent restart itself if it hits a FATAL error

Summary: have the agent restart itself if it hits a FATAL error

Keywords:
Status:	CLOSED NEXTRELEASE
Alias:	RHQ-1212
Product:	RHQ Project
Classification:	Other
Component:	Agent
Sub Component:
Version:	unspecified
Hardware:	All
OS:	All
Priority:	high
Severity:	medium
Target Milestone:	---
Target Release:	---
Assignee:	John Mazzitelli
QA Contact:	Corey Welton
Docs Contact:
URL:	http://jira.rhq-project.org/browse/RH...
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2008-12-04 06:46 UTC by John Mazzitelli
Modified:	2008-12-18 15:28 UTC (History)
CC List:	0 users
Fixed In Version:	1.2
Doc Type:	Enhancement
Doc Text:
Clone Of:
Environment:
Last Closed:
Embargoed:

Attachments	(Terms of Use)

Description John Mazzitelli 2008-12-04 06:46:00 UTC

Rather than the agent sitting dead in the water when it hits a FATAL error, it should at least restart itself after about 1 min just to try again. No sense it sitting there doing nothing, let it try to recover.

In the case of RHQ-1211, I think if it just restarted, and re-updated the plugins, it probably would have continued to operate fine.

Comment 1 John Mazzitelli 2008-12-04 21:33:04 UTC

if an agent registration exception occurs, we'll retry up to 5 times, 60 seconds apart.  This hopefully helps when the agent's box is loaded down and the agent cannot respond to a ping fast enough.  Hopefully, the agent box will be stable within 5 mins... if not, we assume the error is permanent and give up - the agent process remains running, it doesn't die, but the agent process is effectively useless. might want to consider doing something in this case, even a System.exit() would be appropriate here because the agent can't do anything unless it was registered

In addition, if the plugin container fails to start, we shutdown and restart the agent up to 5 times. If we still can't get the PC to start, we will assume the error is permanent and stop - the agent will actually exit in this case.

Comment 2 Corey Welton 2008-12-18 15:25:23 UTC

QA Verified, this occurs correctly:

Easy way to test this --  start an agent, and then attempt to run another instance of the same agent.  You'll get the fatal error about the Address already being in use, and the agent will attempt to reload itself five times before finally giving up.

Comment 3 John Mazzitelli 2008-12-18 15:28:24 UTC

and if you killed that first one in the middle of those retries, your second agent should eventually start up successfully during one  of those 5 retries.

Comment 4 Red Hat Bugzilla 2009-11-10 20:28:11 UTC

This bug was previously known as http://jira.rhq-project.org/browse/RHQ-1212

Note You need to log in before you can comment on or make changes to this bug.