Bug 245381 - [RFE] Restart counters before a switch to relocate.
Summary: [RFE] Restart counters before a switch to relocate.
Alias: None
Product: Red Hat Cluster Suite
Classification: Retired
Component: rgmanager   
(Show other bugs)
Version: 4
Hardware: All
OS: Linux
Target Milestone: ---
Assignee: Lon Hohberger
QA Contact: Cluster QE
Depends On:
TreeView+ depends on / blocked
Reported: 2007-06-22 18:54 UTC by Charlie Wyse
Modified: 2009-04-16 20:22 UTC (History)
1 user (show)

Fixed In Version: RHBA-2008-0791
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Last Closed: 2008-07-25 19:15:09 UTC
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---

Attachments (Terms of Use)
Patch. pass 1. (2.48 KB, patch)
2007-08-21 20:36 UTC, Lon Hohberger
no flags Details | Diff
Pass 2; adds support to the resource-agent so that it's picked up via the config (3.41 KB, patch)
2007-08-21 20:41 UTC, Lon Hohberger
no flags Details | Diff

External Trackers
Tracker ID Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2008:0791 normal SHIPPED_LIVE rgmanager bug fix and enhancement update 2008-07-25 19:14:58 UTC

Description Charlie Wyse 2007-06-22 18:54:21 UTC
From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv: Gecko/20070313 Fedora/ Firefox/

Description of problem:
Working with a customer that requires for a services to restart locally.  If the service is restarted locally for X amount of times in Y time frame.  For example, 3 times in 1 hour.  Then the services is relocated to a differnt node in the cluster.

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:
1. Service has a problem, possibly due to hardware and is restarted.
2. The problem happens again and is restarted.
3. This creates a loop where the service is started over and over again but never relocated to a different machine.

Actual Results:
The services just keeps restarting and never relocates to a new server.  This could be damanging as it requires manual intervention.  

Expected Results:
You should be able to set times in the services tab that allow you to specify that if the service is restarted X amount of times within Y time frame.  To just relocate the service to a seperate machine.

Additional info:
The X for times and Y for time frame values are a must as some customers might find that 3 restarts in an hour is is a good limit. While some customers might only want 2 restarts in a 24 hours period.  Or possibly a week long period.

Comment 1 Lon Hohberger 2007-07-11 19:57:26 UTC
The restarts themselves are not tracked currently in rgmanager; that is, a
restart itself is not recorded long-term; it is handled and never worried about

In order to implement a 'time-based' limit on X restarts, we would either need
to store more information in VF (such as an ancillary data block to record
restart histories), store the information locally (other nodes shouldn't care
about this information - since they're not running the service), or alter the
semantics of how parts of the rg_state_t structure are used:

typedef struct {
        char            rs_name[64];    /**< Service name */
        uint64_t        rs_owner;       /**< Member ID running service. */
        uint64_t        rs_last_owner;  /**< Last member to run the service. */
        uint32_t        rs_state;       /**< State of service. */
        uint32_t        rs_restarts;    /**< Number of cluster-induced 
                                             restarts */
        uint64_t        rs_transition;  /**< Last service transition time */
        uint32_t        rs_id;          /**< Service ID */
        uint32_t        rs_pad;         /**< pad to 64-bit boundary */
} rg_state_t;

(and utilize the rs_pad field for something...).  Basically, changing the size
of the above structure can not be done - it will break rolling upgrade to do so.

Comment 2 Lon Hohberger 2007-07-11 19:59:08 UTC
With a node-local recording of cluster-induced restarts, it is very easy to
throttle restarts based on X in Y time.

Comment 5 Lon Hohberger 2007-08-21 20:36:44 UTC
Created attachment 162009 [details]
Patch. pass 1.

Comment 6 Lon Hohberger 2007-08-21 20:41:18 UTC
Created attachment 162010 [details]
Pass 2; adds support to the resource-agent so that it's picked up via the config

Note: does not do time-based throttling; only a hard limit.

Comment 7 Lon Hohberger 2007-08-21 20:43:18 UTC
The rs_id and rs_pad fields are not used by rgmanager.  We could use these
fields as a "first-start" time.

Comment 8 Lon Hohberger 2007-08-21 20:44:17 UTC
(It's not even endian-swapped in reslist.h)

Comment 11 Lon Hohberger 2008-04-15 15:07:16 UTC
Pushed to RHEL4 git branch

Comment 14 errata-xmlrpc 2008-07-25 19:15:09 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.


Note You need to log in before you can comment on or make changes to this bug.