Description of problem: RHCS only defines three recovery options for a failed process:- - Disable - Restart and relocate if restart fails - Relocate There is not a "Restart but do not relocate" option. The use case is a configuration running multiple custom/flaky applications using the same storage and IP address. If an individual application fails, the customer wants to attempt restart(s), but if the restart of a individual application fails, there is absolutely no point in relocating, because it's unlikely fix the problem and just mess up the other applications running on the same box. Note: Veritas defines this as critical vs non-critical applications. Failure of a critical app triggers failover. Failure of a non-critical doesn't.
Lon and I discussed this a bit on IRC. Today you could do this by setting up a failover domain for a service to be a single host. If you do this, the service should be restarted if it fails on the single host which makes up its failover domain, but if that host fails it will not be relocated to another host outside of that domain. So not sure, but it could be that this bug can be closed as CURRENTRELEASE. thunt, can you let us know if the above configuration suggestion would satisfy your requirements?
That was one of the workarounds considered. Unfortunately, it doesn't handle the case when (say) the host fails, and the whole configuration needs to be relocated. Incidentally, it helps to think in terms of multiple applications/services running together and using the same resources, for example all running from the same disk partition. The requirement makes a lot less sense for a single application/service since the primary intent is to avoid disrupting the other applications on the host when one of them fails. Tim
There are two levels here: 1) per-resource: - if you have a service which includes a file system, a database and a web server, you might say that the web server is a non-critical resource. This is partially implemented using __independent_subtree but there is no facility to give up and stop restarting (or even bothering to check) the state of the web server. There is also currently no facility to query which individual resources have failed, nor a method to restore those resources to operation in coordination with rgmanager. 2) per-service: - simply adding a 'restart-only' recovery policy which, if a restart fails, aborts and marks the service as 'stopped' instead of relocating the service to another node. This is because rgmanager does not have collocation dependencies outside of parent/child ordering within a service. I suspect that the following limitation will have to apply: * multiply instanced resources (e.g. the same <script> being used in multiple times either in a single service or multiple services) will not be allowed to be used as non-critical resources. This is because it will be nearly impossible for an administrator to restore a particular instance from the command line. Brainstorming: * Might be simplest to simply add a convalesce operation for a service: clusvcadm -c <service_name> -- Convalesce (restore) service name to operation. This attempts to restore any, non-critical resources within the given service to operation.
oops. ...any failed, non-critical resources within the given service to operation...
It might go without saying but any children of a non-critical resource are also non-critical.
Also, because of dependency ordering (children depend on parent), children of a non-critical resource will be stopped when their parent is stopped.
The addition of a restart-only recovery policy addresses the non-critical requirement. I suspect that 90% (or more) of the use cases for this requirement will be for the service at the bottom of the dependency tree, so there won't be any children to worry about. And for the cases where there are children, it's very reasonable to stop them if the parent can not be restarted.
*** Bug 618810 has been marked as a duplicate of this bug. ***
I think we should have configurable parameter NumRetries (per resource) where NumRetries = number of times a resource can be restarted before rgmanager gives up and stops restarting teh resource flagging it's state as partially_online. By default this should be set to 1. Some customers might want to restart more than once before giving up on a resource.
The other thing I have seen people do is add a "delaybetweenrestarts" so that in case the first restart fails rgmanager can try restarting with delays induced between the restart attempts. This is sometimes more effective than trying to restart resources with no delay. Again this will also need to be configured on a per resource basis with default value set to 0 secs.
Created attachment 437683 [details] Possible patch This adds two additional recovery policies: restart-stop - after restart threshold is exceeded, place service into the 'stopped' state restart-disable - after restart threshold is exceeded, place service into the 'disabled' state Magic decoder ring: The 'stopped' state is temporary and rgmanager will trigger a service evaluation after the next member or service transition. The 'disabled' state remains until either quorum is lost and regained (at which point the service is evaluated according to autostart) or the administrator re-enables it.
If this is not needed for subtrees: <service name="foo" > <script name="web" file="/etc/init.d/httpd" __critical="0" /> <script name="oracle" /> </service> ... then it is a far smaller change - since we don't need to track and report the information back to the users.
Isn't this just another recovery option? If a script is marked as non-critical, it can never fail, making all the other recovery option irrelevant. Also, I think there is an expectation that non-critical services get the same level of control, monitoring and notifications as other services and I'm not sure that making critical an attribute of script element delivers this.
There are three main components: 1) a restart-disable policy on the whole service which interacts with the existing max-restarts / restart-expire-time 2) non-critical independent subtrees: - the ability to let designated resources fail - the ability to recover these resources 3) restart threshold policies on independent subtrees - the ability to define max-restarts / restart-expire-time on a per subtree basis - operation with normal independent subtrees: service goes into recovery when threshold is exceeded - operation with non-critical independent subtrees: disable subtree when threshold is exceeded I'm not sure what else would be required; I believe this satisfies all of the requirements. Here's the actual working operational outline: Configuration: I. 2+ refs to same resource clears independent subtree/non-critical flag. You may only use non-critical flag on singly-referenced resources; you may not use non-critical flag with multiple-instance resources. This limitation is unlikely to change due to limitations in how rgmanager handles multi-reference resources. II. Non-critical flag is applied when the administrator sets __independent_subtree="2" in cluster.conf for a given resource link in the resource tree. III.The non-critical flag works with all resources at all levels of the resource tree, but should not be used at the top level when defining services or virtual machines. IV. Independent subtree per-node max restart thresholds. You can now set max restarts and restart expirations on a per-node basis in the resource tree. A. This implements a sliding window restart tolerance as is done currently at the service (or top) level of rgmanager's resource trees. B. Options 1. __max_restarts => Maximum number of tolerated restarts prior to giving up 2. __restart_expire_time => After this time, a restart is "forgotten". C. BOTH of IV.B.1 and IV.B.2 must be provided. Operational: II. Status failure for __independent_subtree="2" resources A. Failure 1. resources are stopped and not restarted as a consequence a. a stop failure during non-critical recovery does not result in a failed service state, however, a failure during a real relocation/disable/restart/failover/etc. *does* result in a failed service state. b. resource is quiesced and placed into a special internal state which is not part of the distributed state information that rgmanager maintains. c. See II.B.8 for information on interaction with restart tolerance 2. status checks on the resource are disabled (nothing in logs for status checking or attempting to repair the service after the resource is stopped). 3. clustat reporting a. Reports [P] (partial) in normal output next to the service state b. reports "partial" in long mode output in the "Flags" line c. reports "partial" in flags_str in XML output 4. Logs report some parts are stopped and how to fix them a. e.g. clusvcadm -c service:test 5. A failure of an non-critical resource will cause all descendent resources to be stopped as well as the non-critical resource 6. A failure of a descendent of a non-critical resource will cause all descendents of the non-critical resource to be stopped as well as the non-critical resource. B. Recovery 1. disable/enable clears [P] flag and restarts all service parts 2. restart operation clears [P] flag and restarts all service parts 3. relocate operation clears [P] flag and restarts all service parts just like normal 4. convalesce operation clears [P] flag and JUST restarts stopped parts a. failure to repair does not cause a failed state. b. failure to repair returns 'failure' to clusvcadm and leaves the partial flag alone 5. updating the cluster configuration does not by itself attempt to implicitly convalesce the service. 6. removing __independent_subtree="2" from cluster.conf does not cause rgmanager to suddenly care about the previously-failed part of a given service; you MUST still convalesce it to restore the broken parts to operation 7. A failure to restart after a convalesce operation will normally cause a subsequent status check failure and subsequent stop operation naturally. 8. If a restart tolerance is configured a. the independent subtree is restarted until the tolerance is exceeded. b. once the tolerance is exceeded, if the subtree is non-critical, the subtree is stopped c. once the tolerance is exceeded, if the subtree is critical, the service is restarted d. Users must restart or convalesce the service in order to clear any existing restart counters. e. Restart counters are at the subtree level - any child resource of an independent subtree will increment the subtree's restart counter. f. A failure during a stop of any part of a subtree with a restart tolerance on a non-critical subtree immediately disables the subtree; no restart action is performed no matter how the restart tolerance is specified. C. Negative testing 1. clusvcadm -c (convalesce) does nothing on: a. failed b. disabled c. stopped d. transitional states (starting, stopping, recovering, etc.) III. Regression testing A. __independent_subtree="1" resources are correctly restarted without causing service to enter failed state or rest of service to restart. B. non-independent subtree resources correctly cause the entire service to restart C. Interaction with 'Z' (Frozen) flag: 1. Freezing a partially-failed service prints [ZP] in clustat output 2. Partial is not cleared when unfrozen 3. Disabled parts are not automatically fixed when a service is unfrozen IV. rgmanager dump & rg_test output A. Shows 'NON-CRITICAL' when a resource is tagged with __independent_subtree="2" B. State outputs for individual resources 1. rgmanager dumps (created by killing rgmanager with SIGUSR1) show: a. S0 for a stopped resource b. S1 for started c. S2 for failed d. S3 for disabled (e.g. when a resource has been quiesced it moves from S1 -> S3). 2. rg_test output will always show S0 Known Issues: I. Configuration updates A. Removing a broken resource from the configuration which previously was __independent_subtree="2" 1. Does not clear the [P] flag at the service state level 2. Users must issue clusvcadm -c <service> to clear this and/or restart the service to clear the P flag. 3. Users must ensure that no resources remain allocated which would prevent the remaining service bits from stopping cleanly
http://git.fedorahosted.org/git?p=cluster.git;a=commit;h=977a199f73e70d8527a335841a5188bcaaa99477 http://git.fedorahosted.org/git?p=cluster.git;a=commit;h=0c39fc38a75ab456945b1ac2c73b1fd35dc0cf9f http://git.fedorahosted.org/git?p=cluster.git;a=commit;h=438eb80aa512526f345863d6b8011b6ac31e688e http://git.fedorahosted.org/git?p=cluster.git;a=commit;h=d8572889efb66c3540658c3eb9974b8e870d56ed http://git.fedorahosted.org/git?p=cluster.git;a=commit;h=d826bd7fd129f80601a2b33cadc8f1d32f4f8e69 http://git.fedorahosted.org/git?p=cluster.git;a=commit;h=5a512395a0f54262c874f93ff590ca397b9bfeb6 http://git.fedorahosted.org/git?p=cluster.git;a=commit;h=c1de3c2fc07a756a6c96dbe6ebe17257b1e08d52 http://git.fedorahosted.org/git?p=cluster.git;a=commit;h=f17eaaf6827237cd13d9086e7b1fbd6eaf702db1 http://git.fedorahosted.org/git?p=cluster.git;a=commit;h=06993e7d6253dbb9a0e83c8edeba4d7a99f61954 http://git.fedorahosted.org/git?p=cluster.git;a=commit;h=39919bc11779ca18490548199e977c8a93c627be http://git.fedorahosted.org/git?p=cluster.git;a=commit;h=390e19b31e8783b4c80b6e044ec06978bdb20c1e http://git.fedorahosted.org/git?p=cluster.git;a=commit;h=2a7f8f06f369fe255a5eb5e9cd6b2b2d577ed7c5 http://git.fedorahosted.org/git?p=cluster.git;a=commit;h=b05534b22a95d15be95035448a20f6fa321a9a10 http://git.fedorahosted.org/git?p=cluster.git;a=commit;h=4c2014fc4ecd96357881019848a77d25cbe1121b http://git.fedorahosted.org/git?p=cluster.git;a=commit;h=badfcf9fd75c65285c1a231d989346098ea1c42c
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2011-0134.html