Bug 245381 - [RFE] Restart counters before a switch to relocate.
[RFE] Restart counters before a switch to relocate.
Product: Red Hat Cluster Suite
Classification: Red Hat
Component: rgmanager (Show other bugs)
All Linux
medium Severity low
: ---
: ---
Assigned To: Lon Hohberger
Cluster QE
Depends On:
  Show dependency treegraph
Reported: 2007-06-22 14:54 EDT by Charlie Wyse
Modified: 2009-04-16 16:22 EDT (History)
1 user (show)

See Also:
Fixed In Version: RHBA-2008-0791
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Last Closed: 2008-07-25 15:15:09 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---

Attachments (Terms of Use)
Patch. pass 1. (2.48 KB, patch)
2007-08-21 16:36 EDT, Lon Hohberger
no flags Details | Diff
Pass 2; adds support to the resource-agent so that it's picked up via the config (3.41 KB, patch)
2007-08-21 16:41 EDT, Lon Hohberger
no flags Details | Diff

  None (edit)
Description Charlie Wyse 2007-06-22 14:54:21 EDT
From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv: Gecko/20070313 Fedora/ Firefox/

Description of problem:
Working with a customer that requires for a services to restart locally.  If the service is restarted locally for X amount of times in Y time frame.  For example, 3 times in 1 hour.  Then the services is relocated to a differnt node in the cluster.

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:
1. Service has a problem, possibly due to hardware and is restarted.
2. The problem happens again and is restarted.
3. This creates a loop where the service is started over and over again but never relocated to a different machine.

Actual Results:
The services just keeps restarting and never relocates to a new server.  This could be damanging as it requires manual intervention.  

Expected Results:
You should be able to set times in the services tab that allow you to specify that if the service is restarted X amount of times within Y time frame.  To just relocate the service to a seperate machine.

Additional info:
The X for times and Y for time frame values are a must as some customers might find that 3 restarts in an hour is is a good limit. While some customers might only want 2 restarts in a 24 hours period.  Or possibly a week long period.
Comment 1 Lon Hohberger 2007-07-11 15:57:26 EDT
The restarts themselves are not tracked currently in rgmanager; that is, a
restart itself is not recorded long-term; it is handled and never worried about

In order to implement a 'time-based' limit on X restarts, we would either need
to store more information in VF (such as an ancillary data block to record
restart histories), store the information locally (other nodes shouldn't care
about this information - since they're not running the service), or alter the
semantics of how parts of the rg_state_t structure are used:

typedef struct {
        char            rs_name[64];    /**< Service name */
        uint64_t        rs_owner;       /**< Member ID running service. */
        uint64_t        rs_last_owner;  /**< Last member to run the service. */
        uint32_t        rs_state;       /**< State of service. */
        uint32_t        rs_restarts;    /**< Number of cluster-induced 
                                             restarts */
        uint64_t        rs_transition;  /**< Last service transition time */
        uint32_t        rs_id;          /**< Service ID */
        uint32_t        rs_pad;         /**< pad to 64-bit boundary */
} rg_state_t;

(and utilize the rs_pad field for something...).  Basically, changing the size
of the above structure can not be done - it will break rolling upgrade to do so.
Comment 2 Lon Hohberger 2007-07-11 15:59:08 EDT
With a node-local recording of cluster-induced restarts, it is very easy to
throttle restarts based on X in Y time.
Comment 5 Lon Hohberger 2007-08-21 16:36:44 EDT
Created attachment 162009 [details]
Patch. pass 1.
Comment 6 Lon Hohberger 2007-08-21 16:41:18 EDT
Created attachment 162010 [details]
Pass 2; adds support to the resource-agent so that it's picked up via the config

Note: does not do time-based throttling; only a hard limit.
Comment 7 Lon Hohberger 2007-08-21 16:43:18 EDT
The rs_id and rs_pad fields are not used by rgmanager.  We could use these
fields as a "first-start" time.
Comment 8 Lon Hohberger 2007-08-21 16:44:17 EDT
(It's not even endian-swapped in reslist.h)
Comment 11 Lon Hohberger 2008-04-15 11:07:16 EDT
Pushed to RHEL4 git branch
Comment 14 errata-xmlrpc 2008-07-25 15:15:09 EDT
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.


Note You need to log in before you can comment on or make changes to this bug.