239594 – Restart only the failed resource and its dependencies instead of the whole service.

Bug 239594 - Restart only the failed resource and its dependencies instead of the whole service.

Summary: Restart only the failed resource and its dependencies instead of the whole se...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 5
Classification:	Red Hat
Component:	system-config-cluster
Sub Component:
Version:	5.0
Hardware:	All
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	---
Assignee:	Jim Parsons
QA Contact:
Docs Contact:
URL:
Whiteboard:
Depends On:	229650
Blocks:	222082
TreeView+	depends on / blocked

Reported:	2007-05-09 19:02 UTC by Lon Hohberger
Modified:	2009-04-16 22:34 UTC (History)
CC List:	3 users (show)
Fixed In Version:	RHBA-2007-0639
Doc Type:	Enhancement
Doc Text:
Clone Of:
Environment:
Last Closed:	2007-11-07 16:44:58 UTC
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2007:0639	0	normal	SHIPPED_LIVE	system-config-cluster bug fix update	2007-10-30 16:19:41 UTC

Description Lon Hohberger 2007-05-09 19:02:05 UTC

+++ This bug was initially created as a clone of Bug #229650 +++

It should be a good idea to restart only the failed resource and its dependency
instead of the whole service.

For example:

<service>
	<oracle/>
	<ip>
		<script/>
	</ip>
</service>

if ip fails: stop script, stop ip, start ip, start script, without restarting
also oracle. 

This should, probably, require the ability to disable the implied dependency
ordering (I'm not sure about this).

Another question is: If a resource has in it's agent the attributes maxinstances
> 1, it can appear different times in the same service or in different services,
so if it has to be stopped all the dependency needs to be calculated (but I
think this can be reported in another bug report as now looks like it's avoided
by not stopping the resource like happen in clusterfs).

-- Additional comment from lhh on 2007-02-22 10:05 EST --
Ok, this will require additional configuration information, because currently,
resources have an "all children alive" requirement.  That is, if any parent
resource has a child which is not in proper operational status, the parent is
considered broken as well.

This, of course, makes status checks easy: the service is broken if anything in
the service is broken

What we need is a way to have rgmanager ignore errors if they're immediately
correctable on a per-resource basis.  This is like the "recovery" flag -
however, the "recovery" operation is not allowed to affect *any* other resources
 - even if an explicit parent/child or other dependency exists.  So, "recovery"
will not solve the problem work if a resource has children.

So, what we basically need is a special attribute which can be assigned to
any/all resources which says (basically):

"This subtree is not dependent on its siblings, and can be safely restarted
without the parent being considered broken".

So, to expand on your example:

<service>
	<oracle />
	<ip special_new_attr="1">
		<script/>
	</ip>
</service>

If the IP address winds up missing, the script is stopped and the IP is stopped,
then restarted.  If oracle winds up broken, *everything* is restarted.  To make
them completely independent:

<service>
	<oracle special_new_attr="1"/>
	<ip special_new_attr="1">
		<script/>
	</ip>
</service>

This would work at all levels, too:

<service>
        <fs special_new_attr="1">
                <nfsexport>
                        <nfsclient/>
                </nfsexport>
        </fs>
        <fs special_new_attr="1">
                <nfsexport>
                        <nfsclient/>
                </nfsexport>
        </fs>
        <ip/>
</service>

The above example is an nfs service.  The two file system trees can be restarted
independently, but if the IP fails, everything must be restarted (Note: This
might not work for NFS services due to lock reclaim broadcasts, so is just there
for illustrative purposes).

As for maxinstances, instances of shared resources are expected to be able to
operate independently.  That is, if one instance fails, it does not imply that
they all have failed.  If it does, something is broken in the resource agent
and/or rgmanager.  If it isn't possible to make the resource instances
completely independent of one-another, then the resource agent should not define
maxinstances > 1.

-- Additional comment from lhh on 2007-02-22 10:31 EST --
<service>
        <fs name="one" special_new_attr="1">
                <nfsexport>
                        <nfsclient/>
                </nfsexport>
        </fs>
        <fs name="two" special_new_attr="1">
                <nfsexport>
                        <nfsclient/>
                </nfsexport>
                <script name="scooby"/>
        </fs>
        <ip/>
</service>

In this example, we add a script fs resource named "two".  If "two" fails, all
of its children must be restarted.  That is, the nfsxeport (and client) and the
script named "scooby" are all restarted.  Similarly, adhering to current
rgmanager behavior, if any of "two"'s children fail, everything up to and
including "two" will be restarted.  For example, if "scooby" fails, the
nfsexport/nfsclient children of "two" will also be restarted - and so will "two"
itself.

However, the file system "one" will never be affected by any of "two"'s problems.

-- Additional comment from simone.gotti on 2007-02-23 05:06 EST --
Created an attachment (id=148657)
Patch that will not handle the special flag discussed in comments #1 and #2.


-- Additional comment from simone.gotti on 2007-02-23 05:07 EST --
Hi,

as requested on IRC I attacched an initial patch against CVS HEAD done before
considerations in comments #1 and #2. This patch will ALWAYS stop resources
until the failed one, and, then starting from the failed one.

When the status on a resource fails the flags RF_NEEDSTOP and RF_NEEDSTART of
that node are setted.
Two new functions svc_condstart and svc_condstop are added.

The call to svc_stop was moved inside handle_recover_req, sp it will first check
on the recovery policies and, if needed, calls svc_condstop insted of svc_stop.

Thanks!

-- Additional comment from lhh on 2007-02-23 14:11 EST --
Thanks Simone - I won't be able to look at this in detail for about a week or
so.  Sorry (in advance) for the delay!


NOTE: This is a true 'requirement' flag on 229650; we can not implement UI
support until 229650 is MODIFIED

As it currently stands, 229650 adds checks for "__independent_subtree" flag to
each node in the resource tree, but it is not fully functional.

Comment 1 Lon Hohberger 2007-05-24 19:59:06 UTC

New attribute __independent_subtree which is attached to resources in the *tree*
structure - but not in the <resources> section - is now present.

Comment 2 Don Domingo 2007-06-14 02:48:27 UTC

making this bug block RHEL5.1 release notes (requires_release_notes was set)

please post the necessary release note content in this bug. if no release note
is required for this bug, please clear the requires_release_notes flag and
remove BZ#222082 from "Bug 239594 blocks".

thanks!

Comment 3 Jim Parsons 2007-06-26 03:19:39 UTC

this attribute has been added to the schema. There is no UI checkbox for this
attribute, as this is an advanced option and not a typically used parameter for
the large majority of users.

Lon: I think you would be a better candidate to write the release note blurb
than myself. Marking as NEEDINFO as a signal to you, Lon. Please mark MODIFIED
after adding release note comment.

Comment 4 Lon Hohberger 2007-06-27 15:19:47 UTC

This is feature has very limited use cases, but in reality, the original comment
hits it on the head except for the example:

Normally, the entire service is restarted if any of its child resources fail. 
For most users, this is the correct behavior.  Some, however would prefer that
only a specific set of otherwise independent resources are restarted in the case
of a failure.

If you place __independent_subtree="1" somewhere in the resource tree, only that
resource and its children will be restarted in the case of a failure of that
resource (or one of its children).  If the independent subtree fails to restart,
then the entire service is restarted.


Example:

<service name="example">
        <fs name="One" __independent_subtree="1" ...>
                <nfsexport ...>
                        <nfsclient .../>
                </nfsexport>
        </fs>
        <fs name="Two" ...>
                <nfsexport ...>
                        <nfsclient .../>
                </nfsexport>
                <script name="Database" .../>
        </fs>
        <ip/>
</service>

In this example, we have two file system resources, named One and Two.  If One
fails, it is restarted.  If Two fails, all components of the service are
restarted.  The important thing here is neither Two nor any of its children
(including Database) may depend on any resource provided by One or its children.

In effect, this allows you to split a single, large service into smaller bits
and pieces which are independent.  

NOTES:

* Some resources can not be used in a service with independent subtrees (Samba
resources, for example, since they require a specific service structure...)

* Restarting a portion of a service is not considered a service restart. 
Repeated subtree restarts (due to repeated failures) are not acted upon or
reported (aside from in system logs), so use this option with care.

* If you are unsure about whether this is a safe option for you to use, do not
use it. ;)

Comment 5 Kiersten (Kerri) Anderson 2007-06-27 20:37:11 UTC

Added devel and qa acks to get the errata filed.

Comment 8 Jim Parsons 2007-06-28 01:17:34 UTC

This is an attribute within the cluster.conf file, found in /etc/cluster on a
running node. The attribute can be added to any resource. Example from above...

To use it, you must add the attr to the conf file on one of the nodes, then
propagate the conf file to the cluster - this is a documented procedure. I would
imagine that rgmanager would need to be restarted - but lon can comment on that
if he sees this. Anyway, this should be enough for a release note. Hope this
helped...

<service name="example">
        <fs name="One" __independent_subtree="1" ...>
                <nfsexport ...>
                        <nfsclient .../>
                </nfsexport>
        </fs>
        <fs name="Two" ...>
                <nfsexport ...>
                        <nfsclient .../>
                </nfsexport>
                <script name="Database" .../>
        </fs>
        <ip/>
</service>

Comment 12 Don Domingo 2007-08-07 23:17:36 UTC

further edit:

<snip>
Here, two file system resources are used: One and Two. If One fails, it is
restarted without interrupting Two. If Two fails, all components (One, children
of One and children of Two) are restarted. At no given time are Two and its
children dependent on any resource provided by One.
</snip>

Comment 13 Lon Hohberger 2007-08-30 15:47:22 UTC

Right, this has a bug in it: it's treating non __independent_subtree bits as
independent sometimes.

I am testing a patch.

Comment 17 errata-xmlrpc 2007-11-07 16:44:58 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2007-0639.html

Note You need to log in before you can comment on or make changes to this bug.