472968 – [RFE] - Ability to define cluster resources as critical or non-critical within a service.

Bug 472968 - [RFE] - Ability to define cluster resources as critical or non-critical within a service.

Summary: [RFE] - Ability to define cluster resources as critical or non-critical withi...

Keywords:
Status:	CLOSED WONTFIX
Alias:	None
Product:	Red Hat Cluster Suite
Classification:	Retired
Component:	rgmanager
Sub Component:
Version:	4
Hardware:	All
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	---
Assignee:	Lon Hohberger
QA Contact:	Cluster QE
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2008-11-25 20:52 UTC by Stuart R. Kirk
Modified:	2010-01-21 20:18 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Clones:	493660 (view as bug list)
Environment:
Last Closed:	2010-01-21 20:18:07 UTC
Embargoed:

Attachments	(Terms of Use)

Description Stuart R. Kirk 2008-11-25 20:52:49 UTC

The current ability to establish a separate recovery policy on a per resource basis is not enough due to limitations associated with the recovery policy options.

For example, given a resource of an “Oracle listener” and another resource as an “Oracle instance”, configure the “Oracle listener” as non-critical and the “Oracle instance” as critical within the same service.  This would have the following desired effect:

If the “Oracle listener” resource fails, it attempts to restart a defined number of times and if unsuccessful then fails (stops trying) and alerts the administrator WITHOUT failing or relocating the service that it is a member of thus allowing the “Oracle instance” to continue operating.

If the “Oracle instance” resource fails, because it is critical, it will immediately try to restart the entire service and if this action fails, relocates the service to a different node in the cluster.

The example identifies and issue with the Oracle application "agent" or "resource", but the feature request I am asking for is in general more granular control of non-critical and critical resources within a cluster not just with the Oracle resource. 

For a critical resource, I would like to see the critical resource attempt to restart itself (having the flexibility to define the number of failures over a definable period of time would be a great feature) and by restarting itself I mean that individual resource with the problem attempts to restart itself not the entire service.  If restarting does not resolve the problem (i.e. it fails immediately) after attempting to restart x number of times or once again even if the restart is successful, but the critical resource "faults" y number of times over z period of time, the service group is then re-located to another node in an attempt to revive the critical resource.

It seems at present the only way the define a "non-critical" resource is to give that particular resource a "restart" recovery policy, the other two available recovery policies would seem to ultimately cause a resource to affect the entire service group it was a part of rather than just the individual resources itself.  Once again I am asking for a somewhat tiered approach to the "non-critical" resource definition. For a non-critical resource if a fault is detected I would like the cluster to attempt to restart the non-critical resource and if the restart of the resource fails x number of time or faults y number of times over z period of time, then since we have identified this resource as non-critical it "disables" just the individual faulty non-critical resource leaving the rest of the service up and running while alerting the administrator.

Comment 1 Lon Hohberger 2008-12-08 22:17:19 UTC

Certain parts of this can be implemented rather simply.  We already record the last operation of each type internally; it's simply a matter of recording multiple failures (and we already have a general purpose mechanism to do this).

 * Add restart counters to each node in the resource tree,
 * If there is a restart counter structure, follow standard max_restarts and restart_expire_time for that resource.

Comment 2 Lon Hohberger 2009-02-27 22:28:45 UTC

Part of the feature request works today using __independent_subtree - what is missing is a defined # of restarts if a resource before giving up and restarting the service.

__independent_subtree treats a node and all of its children as 'non-critical' - meaning they can restart independently of a service restart.

Comment 4 RHEL Program Management 2010-01-21 20:18:07 UTC

Development Management has reviewed and declined this request.  You may appeal
this decision by reopening this request.

Note You need to log in before you can comment on or make changes to this bug.