Bug 245381 - [RFE] Restart counters before a switch to relocate.
[RFE] Restart counters before a switch to relocate.
Status: CLOSED ERRATA
Product: Red Hat Cluster Suite
Classification: Red Hat
Component: rgmanager (Show other bugs)
4
All Linux
medium Severity low
: ---
: ---
Assigned To: Lon Hohberger
Cluster QE
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2007-06-22 14:54 EDT by Charlie Wyse
Modified: 2009-04-16 16:22 EDT (History)
1 user (show)

See Also:
Fixed In Version: RHBA-2008-0791
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2008-07-25 15:15:09 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
Patch. pass 1. (2.48 KB, patch)
2007-08-21 16:36 EDT, Lon Hohberger
no flags Details | Diff
Pass 2; adds support to the resource-agent so that it's picked up via the config (3.41 KB, patch)
2007-08-21 16:41 EDT, Lon Hohberger
no flags Details | Diff

  None (edit)
Description Charlie Wyse 2007-06-22 14:54:21 EDT
From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.0.10) Gecko/20070313 Fedora/1.5.0.10-5.fc6 Firefox/1.5.0.10

Description of problem:
Working with a customer that requires for a services to restart locally.  If the service is restarted locally for X amount of times in Y time frame.  For example, 3 times in 1 hour.  Then the services is relocated to a differnt node in the cluster.

Version-Release number of selected component (if applicable):


How reproducible:
Always


Steps to Reproduce:
1. Service has a problem, possibly due to hardware and is restarted.
2. The problem happens again and is restarted.
3. This creates a loop where the service is started over and over again but never relocated to a different machine.

Actual Results:
The services just keeps restarting and never relocates to a new server.  This could be damanging as it requires manual intervention.  

Expected Results:
You should be able to set times in the services tab that allow you to specify that if the service is restarted X amount of times within Y time frame.  To just relocate the service to a seperate machine.

Additional info:
The X for times and Y for time frame values are a must as some customers might find that 3 restarts in an hour is is a good limit. While some customers might only want 2 restarts in a 24 hours period.  Or possibly a week long period.
Comment 1 Lon Hohberger 2007-07-11 15:57:26 EDT
The restarts themselves are not tracked currently in rgmanager; that is, a
restart itself is not recorded long-term; it is handled and never worried about
again.

In order to implement a 'time-based' limit on X restarts, we would either need
to store more information in VF (such as an ancillary data block to record
restart histories), store the information locally (other nodes shouldn't care
about this information - since they're not running the service), or alter the
semantics of how parts of the rg_state_t structure are used:

typedef struct {
        char            rs_name[64];    /**< Service name */
        uint64_t        rs_owner;       /**< Member ID running service. */
        uint64_t        rs_last_owner;  /**< Last member to run the service. */
        uint32_t        rs_state;       /**< State of service. */
        uint32_t        rs_restarts;    /**< Number of cluster-induced 
                                             restarts */
        uint64_t        rs_transition;  /**< Last service transition time */
        uint32_t        rs_id;          /**< Service ID */
        uint32_t        rs_pad;         /**< pad to 64-bit boundary */
} rg_state_t;

(and utilize the rs_pad field for something...).  Basically, changing the size
of the above structure can not be done - it will break rolling upgrade to do so.
Comment 2 Lon Hohberger 2007-07-11 15:59:08 EDT
With a node-local recording of cluster-induced restarts, it is very easy to
throttle restarts based on X in Y time.
Comment 5 Lon Hohberger 2007-08-21 16:36:44 EDT
Created attachment 162009 [details]
Patch. pass 1.
Comment 6 Lon Hohberger 2007-08-21 16:41:18 EDT
Created attachment 162010 [details]
Pass 2; adds support to the resource-agent so that it's picked up via the config

Note: does not do time-based throttling; only a hard limit.
Comment 7 Lon Hohberger 2007-08-21 16:43:18 EDT
The rs_id and rs_pad fields are not used by rgmanager.  We could use these
fields as a "first-start" time.
Comment 8 Lon Hohberger 2007-08-21 16:44:17 EDT
(It's not even endian-swapped in reslist.h)
Comment 11 Lon Hohberger 2008-04-15 11:07:16 EDT
Pushed to RHEL4 git branch
Comment 14 errata-xmlrpc 2008-07-25 15:15:09 EDT
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2008-0791.html

Note You need to log in before you can comment on or make changes to this bug.