Description of problem:
RHCS only defines three recovery options for a failed process:-
- Restart and relocate if restart fails
There is not a "Restart but do not relocate" option.
The use case is a configuration running multiple custom/flaky applications using the same storage and IP address. If an individual application fails, the customer wants to attempt restart(s), but if the restart of a individual application fails, there is absolutely no point in relocating, because it's unlikely fix the problem and just mess up the other applications running on the same box.
Note: Veritas defines this as critical vs non-critical applications. Failure of a critical app triggers failover. Failure of a non-critical doesn't.
Lon and I discussed this a bit on IRC. Today you could do this by setting up a failover domain for a service to be a single host. If you do this, the service should be restarted if it fails on the single host which makes up its failover domain, but if that host fails it will not be relocated to another host outside of that domain.
So not sure, but it could be that this bug can be closed as CURRENTRELEASE.
thunt, can you let us know if the above configuration suggestion would satisfy your requirements?
That was one of the workarounds considered. Unfortunately, it doesn't handle the case when (say) the host fails, and the whole configuration needs to be relocated.
Incidentally, it helps to think in terms of multiple applications/services running together and using the same resources, for example all running from the same disk partition. The requirement makes a lot less sense for a single application/service since the primary intent is to avoid disrupting the other applications on the host when one of them fails.
There are two levels here:
- if you have a service which includes a file system, a
database and a web server, you might say that the
web server is a non-critical resource. This is partially
implemented using __independent_subtree but there is no
facility to give up and stop restarting (or even bothering
to check) the state of the web server. There is also
currently no facility to query which individual resources
have failed, nor a method to restore those resources to
operation in coordination with rgmanager.
- simply adding a 'restart-only' recovery policy which, if
a restart fails, aborts and marks the service as 'stopped'
instead of relocating the service to another node.
This is because rgmanager does not have collocation dependencies outside of parent/child ordering within a service.
I suspect that the following limitation will have to apply:
* multiply instanced resources (e.g. the same <script> being used
in multiple times either in a single service or multiple services)
will not be allowed to be used as non-critical resources. This is
because it will be nearly impossible for an administrator to restore
a particular instance from the command line.
* Might be simplest to simply add a convalesce operation for a service:
clusvcadm -c <service_name> -- Convalesce (restore) service name
to operation. This attempts to restore any, non-critical
resources within the given service to operation.
...any failed, non-critical resources within the given service to operation...
It might go without saying but any children of a non-critical resource are also non-critical.
Also, because of dependency ordering (children depend on parent), children of a non-critical resource will be stopped when their parent is stopped.
The addition of a restart-only recovery policy addresses the non-critical requirement.
I suspect that 90% (or more) of the use cases for this requirement will be for the service at the bottom of the dependency tree, so there won't be any children to worry about. And for the cases where there are children, it's very reasonable to stop them if the parent can not be restarted.
*** Bug 618810 has been marked as a duplicate of this bug. ***
I think we should have configurable parameter NumRetries (per resource) where NumRetries = number of times a resource can be restarted before rgmanager gives up and stops restarting teh resource flagging it's state as partially_online. By default this should be set to 1. Some customers might want to restart more than once before giving up on a resource.
The other thing I have seen people do is add a "delaybetweenrestarts" so that in case the first restart fails rgmanager can try restarting with delays induced between the restart attempts. This is sometimes more effective than trying to restart resources with no delay. Again this will also need to be configured on a per resource basis with default value set to 0 secs.
Created attachment 437683 [details]
This adds two additional recovery policies:
restart-stop - after restart threshold is exceeded, place
service into the 'stopped' state
restart-disable - after restart threshold is exceeded,
place service into the 'disabled' state
Magic decoder ring: The 'stopped' state is temporary and rgmanager will trigger a service evaluation after the next member or service transition. The 'disabled' state remains until either quorum is lost and regained (at which point the service is evaluated according to autostart) or the administrator re-enables it.
If this is not needed for subtrees:
<service name="foo" >
<script name="web" file="/etc/init.d/httpd" __critical="0" />
<script name="oracle" />
... then it is a far smaller change - since we don't need to track and report the information back to the users.
Isn't this just another recovery option? If a script is marked as non-critical, it can never fail, making all the other recovery option irrelevant.
Also, I think there is an expectation that non-critical services get the same level of control, monitoring and notifications as other services and I'm not sure that making critical an attribute of script element delivers this.
There are three main components:
1) a restart-disable policy on the whole service which interacts
with the existing max-restarts / restart-expire-time
2) non-critical independent subtrees:
- the ability to let designated resources fail
- the ability to recover these resources
3) restart threshold policies on independent subtrees
- the ability to define max-restarts / restart-expire-time
on a per subtree basis
- operation with normal independent subtrees:
service goes into recovery when threshold is exceeded
- operation with non-critical independent subtrees:
disable subtree when threshold is exceeded
I'm not sure what else would be required; I believe this satisfies all of the requirements.
Here's the actual working operational outline:
I. 2+ refs to same resource clears independent
subtree/non-critical flag. You may only use non-critical flag
on singly-referenced resources; you may not use non-critical flag
with multiple-instance resources. This limitation is unlikely to
change due to limitations in how rgmanager handles multi-reference
II. Non-critical flag is applied when the administrator sets
__independent_subtree="2" in cluster.conf for a given resource link
in the resource tree.
III.The non-critical flag works with all resources at all levels of the resource
tree, but should not be used at the top level when defining services or
IV. Independent subtree per-node max restart thresholds.
You can now set max restarts and restart expirations on a per-node
basis in the resource tree.
A. This implements a sliding window restart tolerance as is done currently
at the service (or top) level of rgmanager's resource trees.
1. __max_restarts => Maximum number of tolerated restarts prior to
2. __restart_expire_time => After this time, a restart is "forgotten".
C. BOTH of IV.B.1 and IV.B.2 must be provided.
II. Status failure for __independent_subtree="2" resources
1. resources are stopped and not restarted as a consequence
a. a stop failure during non-critical recovery does not result
in a failed service state, however, a failure during a real
relocation/disable/restart/failover/etc. *does* result in
a failed service state.
b. resource is quiesced and placed into a special internal
state which is not part of the distributed state information
that rgmanager maintains.
c. See II.B.8 for information on interaction with restart tolerance
2. status checks on the resource are disabled (nothing in logs for
status checking or attempting to repair the service after the
resource is stopped).
3. clustat reporting
a. Reports [P] (partial) in normal output next to the service
b. reports "partial" in long mode output in the "Flags" line
c. reports "partial" in flags_str in XML output
4. Logs report some parts are stopped and how to fix them
a. e.g. clusvcadm -c service:test
5. A failure of an non-critical resource will cause all descendent
resources to be stopped as well as the non-critical resource
6. A failure of a descendent of a non-critical resource will cause
all descendents of the non-critical resource to be stopped as well
as the non-critical resource.
1. disable/enable clears [P] flag and restarts all service parts
2. restart operation clears [P] flag and restarts all service parts
3. relocate operation clears [P] flag and restarts all service parts
just like normal
4. convalesce operation clears [P] flag and JUST restarts stopped
a. failure to repair does not cause a failed state.
b. failure to repair returns 'failure' to clusvcadm and leaves
the partial flag alone
5. updating the cluster configuration does not by itself attempt
to implicitly convalesce the service.
6. removing __independent_subtree="2" from cluster.conf does not
cause rgmanager to suddenly care about the previously-failed part
of a given service; you MUST still convalesce it to restore the
broken parts to operation
7. A failure to restart after a convalesce operation will normally
cause a subsequent status check failure and subsequent stop operation
8. If a restart tolerance is configured
a. the independent subtree is restarted until the tolerance is
b. once the tolerance is exceeded, if the subtree is non-critical,
the subtree is stopped
c. once the tolerance is exceeded, if the subtree is critical,
the service is restarted
d. Users must restart or convalesce the service in order to clear
any existing restart counters.
e. Restart counters are at the subtree level - any child resource
of an independent subtree will increment the subtree's restart
f. A failure during a stop of any part of a subtree with a restart
tolerance on a non-critical subtree immediately disables the
subtree; no restart action is performed no matter how the
restart tolerance is specified.
C. Negative testing
1. clusvcadm -c (convalesce) does nothing on:
d. transitional states (starting, stopping, recovering, etc.)
III. Regression testing
A. __independent_subtree="1" resources are correctly restarted
without causing service to enter failed state or rest of
service to restart.
B. non-independent subtree resources correctly cause the entire
service to restart
C. Interaction with 'Z' (Frozen) flag:
1. Freezing a partially-failed service prints [ZP] in clustat output
2. Partial is not cleared when unfrozen
3. Disabled parts are not automatically fixed when a service is
IV. rgmanager dump & rg_test output
A. Shows 'NON-CRITICAL' when a resource is tagged with
B. State outputs for individual resources
1. rgmanager dumps (created by killing rgmanager with SIGUSR1)
a. S0 for a stopped resource
b. S1 for started
c. S2 for failed
d. S3 for disabled (e.g. when a resource has been quiesced
it moves from S1 -> S3).
2. rg_test output will always show S0
I. Configuration updates
A. Removing a broken resource from the configuration which previously
1. Does not clear the [P] flag at the service state level
2. Users must issue clusvcadm -c <service> to clear this and/or
restart the service to clear the P flag.
3. Users must ensure that no resources remain allocated which
would prevent the remaining service bits from stopping cleanly
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.