+++ This bug was initially created as a clone of Bug #229650 +++ It should be a good idea to restart only the failed resource and its dependency instead of the whole service. For example: <service> <oracle/> <ip> <script/> </ip> </service> if ip fails: stop script, stop ip, start ip, start script, without restarting also oracle. This should, probably, require the ability to disable the implied dependency ordering (I'm not sure about this). Another question is: If a resource has in it's agent the attributes maxinstances > 1, it can appear different times in the same service or in different services, so if it has to be stopped all the dependency needs to be calculated (but I think this can be reported in another bug report as now looks like it's avoided by not stopping the resource like happen in clusterfs). -- Additional comment from lhh on 2007-02-22 10:05 EST -- Ok, this will require additional configuration information, because currently, resources have an "all children alive" requirement. That is, if any parent resource has a child which is not in proper operational status, the parent is considered broken as well. This, of course, makes status checks easy: the service is broken if anything in the service is broken What we need is a way to have rgmanager ignore errors if they're immediately correctable on a per-resource basis. This is like the "recovery" flag - however, the "recovery" operation is not allowed to affect *any* other resources - even if an explicit parent/child or other dependency exists. So, "recovery" will not solve the problem work if a resource has children. So, what we basically need is a special attribute which can be assigned to any/all resources which says (basically): "This subtree is not dependent on its siblings, and can be safely restarted without the parent being considered broken". So, to expand on your example: <service> <oracle /> <ip special_new_attr="1"> <script/> </ip> </service> If the IP address winds up missing, the script is stopped and the IP is stopped, then restarted. If oracle winds up broken, *everything* is restarted. To make them completely independent: <service> <oracle special_new_attr="1"/> <ip special_new_attr="1"> <script/> </ip> </service> This would work at all levels, too: <service> <fs special_new_attr="1"> <nfsexport> <nfsclient/> </nfsexport> </fs> <fs special_new_attr="1"> <nfsexport> <nfsclient/> </nfsexport> </fs> <ip/> </service> The above example is an nfs service. The two file system trees can be restarted independently, but if the IP fails, everything must be restarted (Note: This might not work for NFS services due to lock reclaim broadcasts, so is just there for illustrative purposes). As for maxinstances, instances of shared resources are expected to be able to operate independently. That is, if one instance fails, it does not imply that they all have failed. If it does, something is broken in the resource agent and/or rgmanager. If it isn't possible to make the resource instances completely independent of one-another, then the resource agent should not define maxinstances > 1. -- Additional comment from lhh on 2007-02-22 10:31 EST -- <service> <fs name="one" special_new_attr="1"> <nfsexport> <nfsclient/> </nfsexport> </fs> <fs name="two" special_new_attr="1"> <nfsexport> <nfsclient/> </nfsexport> <script name="scooby"/> </fs> <ip/> </service> In this example, we add a script fs resource named "two". If "two" fails, all of its children must be restarted. That is, the nfsxeport (and client) and the script named "scooby" are all restarted. Similarly, adhering to current rgmanager behavior, if any of "two"'s children fail, everything up to and including "two" will be restarted. For example, if "scooby" fails, the nfsexport/nfsclient children of "two" will also be restarted - and so will "two" itself. However, the file system "one" will never be affected by any of "two"'s problems. -- Additional comment from simone.gotti on 2007-02-23 05:06 EST -- Created an attachment (id=148657) Patch that will not handle the special flag discussed in comments #1 and #2. -- Additional comment from simone.gotti on 2007-02-23 05:07 EST -- Hi, as requested on IRC I attacched an initial patch against CVS HEAD done before considerations in comments #1 and #2. This patch will ALWAYS stop resources until the failed one, and, then starting from the failed one. When the status on a resource fails the flags RF_NEEDSTOP and RF_NEEDSTART of that node are setted. Two new functions svc_condstart and svc_condstop are added. The call to svc_stop was moved inside handle_recover_req, sp it will first check on the recovery policies and, if needed, calls svc_condstop insted of svc_stop. Thanks! -- Additional comment from lhh on 2007-02-23 14:11 EST -- Thanks Simone - I won't be able to look at this in detail for about a week or so. Sorry (in advance) for the delay! NOTE: This is a true 'requirement' flag on 229650; we can not implement UI support until 229650 is MODIFIED As it currently stands, 229650 adds checks for "__independent_subtree" flag to each node in the resource tree, but it is not fully functional.
New attribute __independent_subtree which is attached to resources in the *tree* structure - but not in the <resources> section - is now present.
making this bug block RHEL5.1 release notes (requires_release_notes was set) please post the necessary release note content in this bug. if no release note is required for this bug, please clear the requires_release_notes flag and remove BZ#222082 from "Bug 239594 blocks". thanks!
this attribute has been added to the schema. There is no UI checkbox for this attribute, as this is an advanced option and not a typically used parameter for the large majority of users. Lon: I think you would be a better candidate to write the release note blurb than myself. Marking as NEEDINFO as a signal to you, Lon. Please mark MODIFIED after adding release note comment.
This is feature has very limited use cases, but in reality, the original comment hits it on the head except for the example: Normally, the entire service is restarted if any of its child resources fail. For most users, this is the correct behavior. Some, however would prefer that only a specific set of otherwise independent resources are restarted in the case of a failure. If you place __independent_subtree="1" somewhere in the resource tree, only that resource and its children will be restarted in the case of a failure of that resource (or one of its children). If the independent subtree fails to restart, then the entire service is restarted. Example: <service name="example"> <fs name="One" __independent_subtree="1" ...> <nfsexport ...> <nfsclient .../> </nfsexport> </fs> <fs name="Two" ...> <nfsexport ...> <nfsclient .../> </nfsexport> <script name="Database" .../> </fs> <ip/> </service> In this example, we have two file system resources, named One and Two. If One fails, it is restarted. If Two fails, all components of the service are restarted. The important thing here is neither Two nor any of its children (including Database) may depend on any resource provided by One or its children. In effect, this allows you to split a single, large service into smaller bits and pieces which are independent. NOTES: * Some resources can not be used in a service with independent subtrees (Samba resources, for example, since they require a specific service structure...) * Restarting a portion of a service is not considered a service restart. Repeated subtree restarts (due to repeated failures) are not acted upon or reported (aside from in system logs), so use this option with care. * If you are unsure about whether this is a safe option for you to use, do not use it. ;)
Added devel and qa acks to get the errata filed.
This is an attribute within the cluster.conf file, found in /etc/cluster on a running node. The attribute can be added to any resource. Example from above... To use it, you must add the attr to the conf file on one of the nodes, then propagate the conf file to the cluster - this is a documented procedure. I would imagine that rgmanager would need to be restarted - but lon can comment on that if he sees this. Anyway, this should be enough for a release note. Hope this helped... <service name="example"> <fs name="One" __independent_subtree="1" ...> <nfsexport ...> <nfsclient .../> </nfsexport> </fs> <fs name="Two" ...> <nfsexport ...> <nfsclient .../> </nfsexport> <script name="Database" .../> </fs> <ip/> </service>
further edit: <snip> Here, two file system resources are used: One and Two. If One fails, it is restarted without interrupting Two. If Two fails, all components (One, children of One and children of Two) are restarted. At no given time are Two and its children dependent on any resource provided by One. </snip>
Right, this has a bug in it: it's treating non __independent_subtree bits as independent sometimes. I am testing a patch.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2007-0639.html