605733 – RFE: Critical/Non-Critical services & resources

Bug 605733 - RFE: Critical/Non-Critical services & resources

Summary: RFE: Critical/Non-Critical services & resources

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 5
Classification:	Red Hat
Component:	rgmanager
Sub Component:
Version:	5.5
Hardware:	All
OS:	Linux
Priority:	high
Severity:	medium
Target Milestone:	rc
Target Release:	---
Assignee:	Lon Hohberger
QA Contact:	Cluster QE
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	618810 (view as bug list)
Depends On:
Blocks:	637259
TreeView+	depends on / blocked

Reported:	2010-06-18 16:14 UTC by thunt
Modified:	2016-04-26 13:26 UTC (History)
CC List:	5 users (show)
Fixed In Version:	rgmanager-2.0.52-6.17.el5
Doc Type:	Enhancement
Doc Text:
Clone Of:
Clones:	634277 637259 (view as bug list)
Environment:
Last Closed:	2011-01-13 23:26:58 UTC
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
Possible patch (952 bytes, patch) 2010-08-09 19:15 UTC, Lon Hohberger	no flags	Details \| Diff
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2011:0134	0	normal	SHIPPED_LIVE	rgmanager bug fix and enhancement update	2011-01-12 19:20:47 UTC

Description thunt 2010-06-18 16:14:29 UTC

Description of problem:

RHCS only defines three recovery options for a failed process:-
- Disable
- Restart and relocate if restart fails
- Relocate

There is not a "Restart but do not relocate" option.

The use case is a configuration running multiple custom/flaky applications using the same storage and IP address. If an individual application fails, the customer wants to attempt restart(s), but if the restart of a individual application fails, there is absolutely no point in relocating, because it's unlikely fix the problem and just mess up the other applications running on the same box.

Note: Veritas defines this as critical vs non-critical applications. Failure of a critical app triggers failover. Failure of a non-critical doesn't.

Comment 1 Perry Myers 2010-06-19 13:44:08 UTC

Lon and I discussed this a bit on IRC.  Today you could do this by setting up a failover domain for a service to be a single host.  If you do this, the service should be restarted if it fails on the single host which makes up its failover domain, but if that host fails it will not be relocated to another host outside of that domain.

So not sure, but it could be that this bug can be closed as CURRENTRELEASE.

thunt, can you let us know if the above configuration suggestion would satisfy your requirements?

Comment 2 thunt 2010-07-01 17:04:34 UTC

That was one of the workarounds considered. Unfortunately, it doesn't handle the case when (say) the host fails, and the whole configuration needs to be relocated.

Incidentally, it helps to think in terms of multiple applications/services running together and using the same resources, for example all running from the same disk partition. The requirement makes a lot less sense for a single application/service since the primary intent is to avoid disrupting the other applications on the host when one of them fails.

Tim

Comment 3 Lon Hohberger 2010-07-29 20:44:07 UTC

There are two levels here:

1) per-resource:

- if you have a service which includes a file system, a
database and a web server, you might say that the
web server is a non-critical resource. This is partially
implemented using __independent_subtree but there is no
facility to give up and stop restarting (or even bothering
to check) the state of the web server. There is also
currently no facility to query which individual resources
have failed, nor a method to restore those resources to
operation in coordination with rgmanager.

2) per-service:

- simply adding a 'restart-only' recovery policy which, if
a restart fails, aborts and marks the service as 'stopped'
instead of relocating the service to another node.

This is because rgmanager does not have collocation dependencies outside of parent/child ordering within a service.

I suspect that the following limitation will have to apply:

* multiply instanced resources (e.g. the same <script> being used
in multiple times either in a single service or multiple services)
will not be allowed to be used as non-critical resources. This is
because it will be nearly impossible for an administrator to restore
a particular instance from the command line.

Brainstorming:

* Might be simplest to simply add a convalesce operation for a service:

clusvcadm -c <service_name> -- Convalesce (restore) service name
to operation. This attempts to restore any, non-critical
resources within the given service to operation.

Comment 4 Lon Hohberger 2010-07-29 20:44:42 UTC

oops.

...any failed, non-critical resources within the given service to operation...

Comment 5 Lon Hohberger 2010-07-29 21:02:27 UTC

It might go without saying but any children of a non-critical resource are also non-critical.

Comment 6 Lon Hohberger 2010-07-29 21:09:03 UTC

Also, because of dependency ordering (children depend on parent), children of a non-critical resource will be stopped when their parent is stopped.

Comment 7 thunt 2010-07-30 03:43:08 UTC

The addition of a restart-only recovery policy addresses the non-critical requirement.

I suspect that 90% (or more) of the use cases for this requirement will be for the service at the bottom of the dependency tree, so there won't be any children to worry about. And for the cases where there are children, it's very reasonable to stop them if the parent can not be restarted.

Comment 8 Lon Hohberger 2010-08-05 21:01:14 UTC

*** Bug 618810 has been marked as a duplicate of this bug. ***

Comment 9 Sayan Saha 2010-08-09 17:55:53 UTC

I think we should have configurable parameter NumRetries (per resource) where NumRetries = number of times a resource can be restarted before rgmanager gives up and stops restarting teh resource flagging it's state as partially_online. By default this should be set to 1. Some customers might want to restart more than once before giving up on a resource.

Comment 10 Sayan Saha 2010-08-09 18:01:35 UTC

The other thing I have seen people do is add a "delaybetweenrestarts" so that in case the first restart fails rgmanager can try restarting with delays induced between the restart attempts. This is sometimes more effective than trying to restart resources with no delay. Again this will also need to be configured on a per resource basis with default value set to 0 secs.

Comment 11 Lon Hohberger 2010-08-09 19:15:07 UTC

Created attachment 437683 [details]
Possible patch

This adds two additional recovery policies:

  restart-stop - after restart threshold is exceeded, place 
                 service into the 'stopped' state
  restart-disable - after restart threshold is exceeded,
                 place service into the 'disabled' state

Magic decoder ring: The 'stopped' state is temporary and rgmanager will trigger a service evaluation after the next member or service transition.  The 'disabled' state remains until either quorum is lost and regained (at which point the service is evaluated according to autostart) or the administrator re-enables it.

Comment 12 Lon Hohberger 2010-08-09 19:23:38 UTC

If this is not needed for subtrees:

  <service name="foo" >
    <script name="web" file="/etc/init.d/httpd" __critical="0" />
    <script name="oracle" />
  </service>

... then it is a far smaller change - since we don't need to track and report the information back to the users.

Comment 13 thunt 2010-08-10 02:37:28 UTC

Isn't this just another recovery option? If a script is marked as non-critical, it can never fail, making all the other recovery option irrelevant.

Also, I think there is an expectation that non-critical services get the same level of control, monitoring and notifications as other services and I'm not sure that making critical an attribute of script element delivers this.

Comment 14 Lon Hohberger 2010-09-15 17:29:19 UTC

There are three main components:

1) a restart-disable policy on the whole service which interacts
with the existing max-restarts / restart-expire-time
2) non-critical independent subtrees:
- the ability to let designated resources fail
- the ability to recover these resources
3) restart threshold policies on independent subtrees
- the ability to define max-restarts / restart-expire-time
on a per subtree basis
- operation with normal independent subtrees:
service goes into recovery when threshold is exceeded
- operation with non-critical independent subtrees:
disable subtree when threshold is exceeded

I'm not sure what else would be required; I believe this satisfies all of the requirements.

Here's the actual working operational outline:

Configuration:

I. 2+ refs to same resource clears independent
subtree/non-critical flag. You may only use non-critical flag
on singly-referenced resources; you may not use non-critical flag
with multiple-instance resources. This limitation is unlikely to
change due to limitations in how rgmanager handles multi-reference
resources.

II. Non-critical flag is applied when the administrator sets
__independent_subtree="2" in cluster.conf for a given resource link
in the resource tree.

III.The non-critical flag works with all resources at all levels of the resource
tree, but should not be used at the top level when defining services or
virtual machines.

IV. Independent subtree per-node max restart thresholds.
You can now set max restarts and restart expirations on a per-node
basis in the resource tree.
A. This implements a sliding window restart tolerance as is done currently
at the service (or top) level of rgmanager's resource trees.
B. Options
1. __max_restarts => Maximum number of tolerated restarts prior to
giving up
2. __restart_expire_time => After this time, a restart is "forgotten".
C. BOTH of IV.B.1 and IV.B.2 must be provided.

Operational:

II. Status failure for __independent_subtree="2" resources
A. Failure
1. resources are stopped and not restarted as a consequence
a. a stop failure during non-critical recovery does not result
in a failed service state, however, a failure during a real
relocation/disable/restart/failover/etc. *does* result in
a failed service state.
b. resource is quiesced and placed into a special internal
state which is not part of the distributed state information
that rgmanager maintains.
c. See II.B.8 for information on interaction with restart tolerance
2. status checks on the resource are disabled (nothing in logs for
status checking or attempting to repair the service after the
resource is stopped).
3. clustat reporting
a. Reports [P] (partial) in normal output next to the service
state
b. reports "partial" in long mode output in the "Flags" line
c. reports "partial" in flags_str in XML output
4. Logs report some parts are stopped and how to fix them
a. e.g. clusvcadm -c service:test
5. A failure of an non-critical resource will cause all descendent
resources to be stopped as well as the non-critical resource
6. A failure of a descendent of a non-critical resource will cause
all descendents of the non-critical resource to be stopped as well
as the non-critical resource.
B. Recovery
1. disable/enable clears [P] flag and restarts all service parts
2. restart operation clears [P] flag and restarts all service parts
3. relocate operation clears [P] flag and restarts all service parts
just like normal
4. convalesce operation clears [P] flag and JUST restarts stopped
parts
a. failure to repair does not cause a failed state.
b. failure to repair returns 'failure' to clusvcadm and leaves
the partial flag alone
5. updating the cluster configuration does not by itself attempt
to implicitly convalesce the service.
6. removing __independent_subtree="2" from cluster.conf does not
cause rgmanager to suddenly care about the previously-failed part
of a given service; you MUST still convalesce it to restore the
broken parts to operation
7. A failure to restart after a convalesce operation will normally
cause a subsequent status check failure and subsequent stop operation
naturally.
8. If a restart tolerance is configured
a. the independent subtree is restarted until the tolerance is
exceeded.
b. once the tolerance is exceeded, if the subtree is non-critical,
the subtree is stopped
c. once the tolerance is exceeded, if the subtree is critical,
the service is restarted
d. Users must restart or convalesce the service in order to clear
any existing restart counters.
e. Restart counters are at the subtree level - any child resource
of an independent subtree will increment the subtree's restart
counter.
f. A failure during a stop of any part of a subtree with a restart
tolerance on a non-critical subtree immediately disables the
subtree; no restart action is performed no matter how the
restart tolerance is specified.
C. Negative testing
1. clusvcadm -c (convalesce) does nothing on:
a. failed
b. disabled
c. stopped
d. transitional states (starting, stopping, recovering, etc.)
III. Regression testing
A. __independent_subtree="1" resources are correctly restarted
without causing service to enter failed state or rest of
service to restart.
B. non-independent subtree resources correctly cause the entire
service to restart
C. Interaction with 'Z' (Frozen) flag:
1. Freezing a partially-failed service prints [ZP] in clustat output
2. Partial is not cleared when unfrozen
3. Disabled parts are not automatically fixed when a service is
unfrozen
IV. rgmanager dump & rg_test output
A. Shows 'NON-CRITICAL' when a resource is tagged with
__independent_subtree="2"
B. State outputs for individual resources
1. rgmanager dumps (created by killing rgmanager with SIGUSR1)
show:
a. S0 for a stopped resource
b. S1 for started
c. S2 for failed
d. S3 for disabled (e.g. when a resource has been quiesced
it moves from S1 -> S3).
2. rg_test output will always show S0

Known Issues:

I. Configuration updates
A. Removing a broken resource from the configuration which previously
was __independent_subtree="2"
1. Does not clear the [P] flag at the service state level
2. Users must issue clusvcadm -c <service> to clear this and/or
restart the service to clear the P flag.
3. Users must ensure that no resources remain allocated which
would prevent the remaining service bits from stopping cleanly

Comment 15 Lon Hohberger 2010-09-24 22:33:54 UTC


http://git.fedorahosted.org/git?p=cluster.git;a=commit;h=977a199f73e70d8527a335841a5188bcaaa99477
http://git.fedorahosted.org/git?p=cluster.git;a=commit;h=0c39fc38a75ab456945b1ac2c73b1fd35dc0cf9f
http://git.fedorahosted.org/git?p=cluster.git;a=commit;h=438eb80aa512526f345863d6b8011b6ac31e688e
http://git.fedorahosted.org/git?p=cluster.git;a=commit;h=d8572889efb66c3540658c3eb9974b8e870d56ed
http://git.fedorahosted.org/git?p=cluster.git;a=commit;h=d826bd7fd129f80601a2b33cadc8f1d32f4f8e69
http://git.fedorahosted.org/git?p=cluster.git;a=commit;h=5a512395a0f54262c874f93ff590ca397b9bfeb6
http://git.fedorahosted.org/git?p=cluster.git;a=commit;h=c1de3c2fc07a756a6c96dbe6ebe17257b1e08d52
http://git.fedorahosted.org/git?p=cluster.git;a=commit;h=f17eaaf6827237cd13d9086e7b1fbd6eaf702db1
http://git.fedorahosted.org/git?p=cluster.git;a=commit;h=06993e7d6253dbb9a0e83c8edeba4d7a99f61954
http://git.fedorahosted.org/git?p=cluster.git;a=commit;h=39919bc11779ca18490548199e977c8a93c627be
http://git.fedorahosted.org/git?p=cluster.git;a=commit;h=390e19b31e8783b4c80b6e044ec06978bdb20c1e
http://git.fedorahosted.org/git?p=cluster.git;a=commit;h=2a7f8f06f369fe255a5eb5e9cd6b2b2d577ed7c5
http://git.fedorahosted.org/git?p=cluster.git;a=commit;h=b05534b22a95d15be95035448a20f6fa321a9a10
http://git.fedorahosted.org/git?p=cluster.git;a=commit;h=4c2014fc4ecd96357881019848a77d25cbe1121b
http://git.fedorahosted.org/git?p=cluster.git;a=commit;h=badfcf9fd75c65285c1a231d989346098ea1c42c

Comment 18 errata-xmlrpc 2011-01-13 23:26:58 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2011-0134.html

Note You need to log in before you can comment on or make changes to this bug.