Red Hat Bugzilla – Bug 472968
[RFE] - Ability to define cluster resources as critical or non-critical within a service.
Last modified: 2010-01-21 15:18:07 EST
The current ability to establish a separate recovery policy on a per resource basis is not enough due to limitations associated with the recovery policy options.
For example, given a resource of an “Oracle listener” and another resource as an “Oracle instance”, configure the “Oracle listener” as non-critical and the “Oracle instance” as critical within the same service. This would have the following desired effect:
If the “Oracle listener” resource fails, it attempts to restart a defined number of times and if unsuccessful then fails (stops trying) and alerts the administrator WITHOUT failing or relocating the service that it is a member of thus allowing the “Oracle instance” to continue operating.
If the “Oracle instance” resource fails, because it is critical, it will immediately try to restart the entire service and if this action fails, relocates the service to a different node in the cluster.
The example identifies and issue with the Oracle application "agent" or "resource", but the feature request I am asking for is in general more granular control of non-critical and critical resources within a cluster not just with the Oracle resource.
For a critical resource, I would like to see the critical resource attempt to restart itself (having the flexibility to define the number of failures over a definable period of time would be a great feature) and by restarting itself I mean that individual resource with the problem attempts to restart itself not the entire service. If restarting does not resolve the problem (i.e. it fails immediately) after attempting to restart x number of times or once again even if the restart is successful, but the critical resource "faults" y number of times over z period of time, the service group is then re-located to another node in an attempt to revive the critical resource.
It seems at present the only way the define a "non-critical" resource is to give that particular resource a "restart" recovery policy, the other two available recovery policies would seem to ultimately cause a resource to affect the entire service group it was a part of rather than just the individual resources itself. Once again I am asking for a somewhat tiered approach to the "non-critical" resource definition. For a non-critical resource if a fault is detected I would like the cluster to attempt to restart the non-critical resource and if the restart of the resource fails x number of time or faults y number of times over z period of time, then since we have identified this resource as non-critical it "disables" just the individual faulty non-critical resource leaving the rest of the service up and running while alerting the administrator.
Certain parts of this can be implemented rather simply. We already record the last operation of each type internally; it's simply a matter of recording multiple failures (and we already have a general purpose mechanism to do this).
* Add restart counters to each node in the resource tree,
* If there is a restart counter structure, follow standard max_restarts and restart_expire_time for that resource.
Part of the feature request works today using __independent_subtree - what is missing is a defined # of restarts if a resource before giving up and restarting the service.
__independent_subtree treats a node and all of its children as 'non-critical' - meaning they can restart independently of a service restart.
Development Management has reviewed and declined this request. You may appeal
this decision by reopening this request.