Bug 605733

Summary: RFE: Critical/Non-Critical services & resources
Product: Red Hat Enterprise Linux 5 Reporter: thunt
Component: rgmanagerAssignee: Lon Hohberger <lhh>
Status: CLOSED ERRATA QA Contact: Cluster QE <mspqa-list>
Severity: medium Docs Contact:
Priority: high    
Version: 5.5CC: cluster-maint, edamato, liko, ssaha, tdunnon
Target Milestone: rcKeywords: FutureFeature
Target Release: ---   
Hardware: All   
OS: Linux   
Fixed In Version: rgmanager-2.0.52-6.17.el5 Doc Type: Enhancement
Doc Text:
Story Points: ---
Clone Of:
: 634277 637259 (view as bug list) Environment:
Last Closed: 2011-01-13 23:26:58 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---
Bug Depends On:    
Bug Blocks: 637259    
Description Flags
Possible patch none

Description thunt 2010-06-18 16:14:29 UTC
Description of problem:

RHCS only defines three recovery options for a failed process:-
- Disable
- Restart and relocate if restart fails
- Relocate

There is not a "Restart but do not relocate" option.

The use case is a configuration running multiple custom/flaky applications using the same storage and IP address. If an individual application fails, the customer wants to attempt restart(s), but if the restart of a individual application fails, there is absolutely no point in relocating, because it's unlikely fix the problem and just mess up the other applications running on the same box.

Note: Veritas defines this as critical vs non-critical applications. Failure of a critical app triggers failover. Failure of a non-critical doesn't.

Comment 1 Perry Myers 2010-06-19 13:44:08 UTC
Lon and I discussed this a bit on IRC.  Today you could do this by setting up a failover domain for a service to be a single host.  If you do this, the service should be restarted if it fails on the single host which makes up its failover domain, but if that host fails it will not be relocated to another host outside of that domain.

So not sure, but it could be that this bug can be closed as CURRENTRELEASE.

thunt, can you let us know if the above configuration suggestion would satisfy your requirements?

Comment 2 thunt 2010-07-01 17:04:34 UTC
That was one of the workarounds considered. Unfortunately, it doesn't handle the case when (say) the host fails, and the whole configuration needs to be relocated.

Incidentally, it helps to think in terms of multiple applications/services running together and using the same resources, for example all running from the same disk partition. The requirement makes a lot less sense for a single application/service since the primary intent is to avoid disrupting the other applications on the host when one of them fails.


Comment 3 Lon Hohberger 2010-07-29 20:44:07 UTC
There are two levels here:

1) per-resource:

   - if you have a service which includes a file system, a
     database and a web server, you might say that the
     web server is a non-critical resource.  This is partially 
     implemented using __independent_subtree but there is no
     facility to give up and stop restarting (or even bothering
     to check) the state of the web server.  There is also 
     currently no facility to query which individual resources 
     have failed, nor a method to restore those resources to 
     operation in coordination with rgmanager.

2) per-service:

   - simply adding a 'restart-only' recovery policy which, if
     a restart fails, aborts and marks the service as 'stopped'
     instead of relocating the service to another node.

This is because rgmanager does not have collocation dependencies outside of parent/child ordering within a service.

I suspect that the following limitation will have to apply:

 * multiply instanced resources (e.g. the same <script> being used
   in multiple times either in a single service or multiple services)
   will not be allowed to be used as non-critical resources.  This is
   because it will be nearly impossible for an administrator to restore
   a particular instance from the command line.


 * Might be simplest to simply add a convalesce operation for a service:

    clusvcadm -c <service_name>  -- Convalesce (restore) service name
      to operation.  This attempts to restore any, non-critical
      resources within the given service to operation.

Comment 4 Lon Hohberger 2010-07-29 20:44:42 UTC

...any failed, non-critical resources within the given service to operation...

Comment 5 Lon Hohberger 2010-07-29 21:02:27 UTC
It might go without saying but any children of a non-critical resource are also non-critical.

Comment 6 Lon Hohberger 2010-07-29 21:09:03 UTC
Also, because of dependency ordering (children depend on parent), children of a non-critical resource will be stopped when their parent is stopped.

Comment 7 thunt 2010-07-30 03:43:08 UTC
The addition of a restart-only recovery policy addresses the non-critical requirement.

I suspect that 90% (or more) of the use cases for this requirement will be for the service at the bottom of the dependency tree, so there won't be any children to worry about. And for the cases where there are children, it's very reasonable to stop them if the parent can not be restarted.

Comment 8 Lon Hohberger 2010-08-05 21:01:14 UTC
*** Bug 618810 has been marked as a duplicate of this bug. ***

Comment 9 Sayan Saha 2010-08-09 17:55:53 UTC
I think we should have configurable parameter NumRetries (per resource) where NumRetries = number of times a resource can be restarted before rgmanager gives up and stops restarting teh resource flagging it's state as partially_online. By default this should be set to 1. Some customers might want to restart more than once before giving up on a resource.

Comment 10 Sayan Saha 2010-08-09 18:01:35 UTC
The other thing I have seen people do is add a "delaybetweenrestarts" so that in case the first restart fails rgmanager can try restarting with delays induced between the restart attempts. This is sometimes more effective than trying to restart resources with no delay. Again this will also need to be configured on a per resource basis with default value set to 0 secs.

Comment 11 Lon Hohberger 2010-08-09 19:15:07 UTC
Created attachment 437683 [details]
Possible patch

This adds two additional recovery policies:

  restart-stop - after restart threshold is exceeded, place 
                 service into the 'stopped' state
  restart-disable - after restart threshold is exceeded,
                 place service into the 'disabled' state

Magic decoder ring: The 'stopped' state is temporary and rgmanager will trigger a service evaluation after the next member or service transition.  The 'disabled' state remains until either quorum is lost and regained (at which point the service is evaluated according to autostart) or the administrator re-enables it.

Comment 12 Lon Hohberger 2010-08-09 19:23:38 UTC
If this is not needed for subtrees:

  <service name="foo" >
    <script name="web" file="/etc/init.d/httpd" __critical="0" />
    <script name="oracle" />

... then it is a far smaller change - since we don't need to track and report the information back to the users.

Comment 13 thunt 2010-08-10 02:37:28 UTC
Isn't this just another recovery option? If a script is marked as non-critical, it can never fail, making all the other recovery option irrelevant.

Also, I think there is an expectation that non-critical services get the same level of control, monitoring and notifications as other services and I'm not sure that making critical an attribute of script element delivers this.

Comment 14 Lon Hohberger 2010-09-15 17:29:19 UTC
There are three main components:

1) a restart-disable policy on the whole service which interacts
   with the existing max-restarts / restart-expire-time
2) non-critical independent subtrees: 
   - the ability to let designated resources fail
   - the ability to recover these resources
3) restart threshold policies on independent subtrees
   - the ability to define max-restarts / restart-expire-time
     on a per subtree basis
   - operation with normal independent subtrees:
     service goes into recovery when threshold is exceeded
   - operation with non-critical independent subtrees: 
     disable subtree when threshold is exceeded

I'm not sure what else would be required; I believe this satisfies all of the requirements.

Here's the actual working operational outline:


I.  2+ refs to same resource clears independent
    subtree/non-critical flag.  You may only use non-critical flag
    on singly-referenced resources; you may not use non-critical flag
    with multiple-instance resources.  This limitation is unlikely to
    change due to limitations in how rgmanager handles multi-reference
II.  Non-critical flag is applied when the administrator sets
    __independent_subtree="2" in cluster.conf for a given resource link
    in the resource tree.

III.The non-critical flag works with all resources at all levels of the resource
    tree, but should not be used at the top level when defining services or
    virtual machines.
IV. Independent subtree per-node max restart thresholds.
    You can now set max restarts and restart expirations on a per-node
    basis in the resource tree.
    A. This implements a sliding window restart tolerance as is done currently
       at the service (or top) level of rgmanager's resource trees.
    B. Options
       1. __max_restarts => Maximum number of tolerated restarts prior to
          giving up
       2. __restart_expire_time => After this time, a restart is "forgotten".
    C. BOTH of IV.B.1 and IV.B.2 must be provided.


II.  Status failure for __independent_subtree="2" resources
    A. Failure
       1. resources are stopped and not restarted as a consequence
          a. a stop failure during non-critical recovery does not result
             in a failed service state, however, a failure during a real
             relocation/disable/restart/failover/etc. *does* result in 
             a failed service state.
          b. resource is quiesced and placed into a special internal
             state which is not part of the distributed state information
             that rgmanager maintains.
          c. See II.B.8 for information on interaction with restart tolerance
       2. status checks on the resource are disabled (nothing in logs for
          status checking or attempting to repair the service after the
          resource is stopped).
       3. clustat reporting
          a. Reports [P] (partial) in normal output next to the service
          b. reports "partial" in long mode output in the "Flags" line
          c. reports "partial" in flags_str in XML output
       4. Logs report some parts are stopped and how to fix them
          a. e.g. clusvcadm -c service:test
       5. A failure of an non-critical resource will cause all descendent
          resources to be stopped as well as the non-critical resource
       6. A failure of a descendent of a non-critical resource will cause
          all descendents of the non-critical resource to be stopped as well
          as the non-critical resource.
    B. Recovery
       1. disable/enable clears [P] flag and restarts all service parts
       2. restart operation clears [P] flag and restarts all service parts
       3. relocate operation clears [P] flag and restarts all service parts
          just like normal
       4. convalesce operation clears [P] flag and JUST restarts stopped
          a. failure to repair does not cause a failed state. 
          b. failure to repair returns 'failure' to clusvcadm and leaves
             the partial flag alone
       5. updating the cluster configuration does not by itself attempt
          to implicitly convalesce the service.
       6. removing __independent_subtree="2" from cluster.conf does not 
          cause rgmanager to suddenly care about the previously-failed part
          of a given service; you MUST still convalesce it to restore the
          broken parts to operation
       7. A failure to restart after a convalesce operation will normally
          cause a subsequent status check failure and subsequent stop operation
       8. If a restart tolerance is configured
          a. the independent subtree is restarted until the tolerance is
          b. once the tolerance is exceeded, if the subtree is non-critical,
             the subtree is stopped
          c. once the tolerance is exceeded, if the subtree is critical,
             the service is restarted
          d. Users must restart or convalesce the service in order to clear
             any existing restart counters.
          e. Restart counters are at the subtree level - any child resource
             of an independent subtree will increment the subtree's restart
          f. A failure during a stop of any part of a subtree with a restart
             tolerance on a non-critical subtree immediately disables the
             subtree; no restart action is performed no matter how the
             restart tolerance is specified.
    C. Negative testing
       1. clusvcadm -c (convalesce) does nothing on:
          a. failed
          b. disabled
          c. stopped
          d. transitional states (starting, stopping, recovering, etc.)
III. Regression testing
    A. __independent_subtree="1" resources are correctly restarted
       without causing service to enter failed state or rest of
       service to restart.
    B. non-independent subtree resources correctly cause the entire
       service to restart
    C. Interaction with 'Z' (Frozen) flag:
       1. Freezing a partially-failed service prints [ZP] in clustat output
       2. Partial is not cleared when unfrozen
       3. Disabled parts are not automatically fixed when a service is 
IV.  rgmanager dump & rg_test output
    A. Shows 'NON-CRITICAL' when a resource is tagged with
    B. State outputs for individual resources
       1. rgmanager dumps (created by killing rgmanager with SIGUSR1) 
          a. S0 for a stopped resource
          b. S1 for started
          c. S2 for failed
          d. S3 for disabled (e.g. when a resource has been quiesced
             it moves from S1 -> S3).
       2. rg_test output will always show S0

Known Issues:

I.  Configuration updates
    A. Removing a broken resource from the configuration which previously
       was __independent_subtree="2"
       1. Does not clear the [P] flag at the service state level
       2. Users must issue clusvcadm -c <service> to clear this and/or
          restart the service to clear the P flag.
       3. Users must ensure that no resources remain allocated which
          would prevent the remaining service bits from stopping cleanly

Comment 15 Lon Hohberger 2010-09-24 22:33:54 UTC


Comment 18 errata-xmlrpc 2011-01-13 23:26:58 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.