Bug 245381

Summary: [RFE] Restart counters before a switch to relocate.
Product: [Retired] Red Hat Cluster Suite Reporter: Charlie Wyse <cwyse>
Component: rgmanagerAssignee: Lon Hohberger <lhh>
Status: CLOSED ERRATA QA Contact: Cluster QE <mspqa-list>
Severity: low Docs Contact:
Priority: medium    
Version: 4CC: cluster-maint
Target Milestone: ---   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: RHBA-2008-0791 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2008-07-25 19:15:09 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
Patch. pass 1.
none
Pass 2; adds support to the resource-agent so that it's picked up via the config none

Description Charlie Wyse 2007-06-22 18:54:21 UTC
From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.0.10) Gecko/20070313 Fedora/1.5.0.10-5.fc6 Firefox/1.5.0.10

Description of problem:
Working with a customer that requires for a services to restart locally.  If the service is restarted locally for X amount of times in Y time frame.  For example, 3 times in 1 hour.  Then the services is relocated to a differnt node in the cluster.

Version-Release number of selected component (if applicable):


How reproducible:
Always


Steps to Reproduce:
1. Service has a problem, possibly due to hardware and is restarted.
2. The problem happens again and is restarted.
3. This creates a loop where the service is started over and over again but never relocated to a different machine.

Actual Results:
The services just keeps restarting and never relocates to a new server.  This could be damanging as it requires manual intervention.  

Expected Results:
You should be able to set times in the services tab that allow you to specify that if the service is restarted X amount of times within Y time frame.  To just relocate the service to a seperate machine.

Additional info:
The X for times and Y for time frame values are a must as some customers might find that 3 restarts in an hour is is a good limit. While some customers might only want 2 restarts in a 24 hours period.  Or possibly a week long period.

Comment 1 Lon Hohberger 2007-07-11 19:57:26 UTC
The restarts themselves are not tracked currently in rgmanager; that is, a
restart itself is not recorded long-term; it is handled and never worried about
again.

In order to implement a 'time-based' limit on X restarts, we would either need
to store more information in VF (such as an ancillary data block to record
restart histories), store the information locally (other nodes shouldn't care
about this information - since they're not running the service), or alter the
semantics of how parts of the rg_state_t structure are used:

typedef struct {
        char            rs_name[64];    /**< Service name */
        uint64_t        rs_owner;       /**< Member ID running service. */
        uint64_t        rs_last_owner;  /**< Last member to run the service. */
        uint32_t        rs_state;       /**< State of service. */
        uint32_t        rs_restarts;    /**< Number of cluster-induced 
                                             restarts */
        uint64_t        rs_transition;  /**< Last service transition time */
        uint32_t        rs_id;          /**< Service ID */
        uint32_t        rs_pad;         /**< pad to 64-bit boundary */
} rg_state_t;

(and utilize the rs_pad field for something...).  Basically, changing the size
of the above structure can not be done - it will break rolling upgrade to do so.

Comment 2 Lon Hohberger 2007-07-11 19:59:08 UTC
With a node-local recording of cluster-induced restarts, it is very easy to
throttle restarts based on X in Y time.

Comment 5 Lon Hohberger 2007-08-21 20:36:44 UTC
Created attachment 162009 [details]
Patch. pass 1.

Comment 6 Lon Hohberger 2007-08-21 20:41:18 UTC
Created attachment 162010 [details]
Pass 2; adds support to the resource-agent so that it's picked up via the config

Note: does not do time-based throttling; only a hard limit.

Comment 7 Lon Hohberger 2007-08-21 20:43:18 UTC
The rs_id and rs_pad fields are not used by rgmanager.  We could use these
fields as a "first-start" time.

Comment 8 Lon Hohberger 2007-08-21 20:44:17 UTC
(It's not even endian-swapped in reslist.h)

Comment 11 Lon Hohberger 2008-04-15 15:07:16 UTC
Pushed to RHEL4 git branch

Comment 14 errata-xmlrpc 2008-07-25 19:15:09 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2008-0791.html