245381 – [RFE] Restart counters before a switch to relocate.

Bug 245381 - [RFE] Restart counters before a switch to relocate.

Summary: [RFE] Restart counters before a switch to relocate.

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Cluster Suite
Classification:	Retired
Component:	rgmanager
Sub Component:
Version:	4
Hardware:	All
OS:	Linux
Priority:	medium
Severity:	low
Target Milestone:	---
Assignee:	Lon Hohberger
QA Contact:	Cluster QE
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2007-06-22 18:54 UTC by Charlie Wyse
Modified:	2009-04-16 20:22 UTC (History)
CC List:	1 user (show)
Fixed In Version:	RHBA-2008-0791
Clone Of:
Environment:
Last Closed:	2008-07-25 19:15:09 UTC
Embargoed:

Attachments	(Terms of Use)
Patch. pass 1. (2.48 KB, patch) 2007-08-21 20:36 UTC, Lon Hohberger	no flags	Details \| Diff
Pass 2; adds support to the resource-agent so that it's picked up via the config (3.41 KB, patch) 2007-08-21 20:41 UTC, Lon Hohberger	no flags	Details \| Diff
Show Obsolete (1) View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2008:0791	0	normal	SHIPPED_LIVE	rgmanager bug fix and enhancement update	2008-07-25 19:14:58 UTC

Description Charlie Wyse 2007-06-22 18:54:21 UTC

From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.0.10) Gecko/20070313 Fedora/1.5.0.10-5.fc6 Firefox/1.5.0.10

Description of problem:
Working with a customer that requires for a services to restart locally.  If the service is restarted locally for X amount of times in Y time frame.  For example, 3 times in 1 hour.  Then the services is relocated to a differnt node in the cluster.

Version-Release number of selected component (if applicable):


How reproducible:
Always


Steps to Reproduce:
1. Service has a problem, possibly due to hardware and is restarted.
2. The problem happens again and is restarted.
3. This creates a loop where the service is started over and over again but never relocated to a different machine.

Actual Results:
The services just keeps restarting and never relocates to a new server.  This could be damanging as it requires manual intervention.  

Expected Results:
You should be able to set times in the services tab that allow you to specify that if the service is restarted X amount of times within Y time frame.  To just relocate the service to a seperate machine.

Additional info:
The X for times and Y for time frame values are a must as some customers might find that 3 restarts in an hour is is a good limit. While some customers might only want 2 restarts in a 24 hours period.  Or possibly a week long period.

Comment 1 Lon Hohberger 2007-07-11 19:57:26 UTC

The restarts themselves are not tracked currently in rgmanager; that is, a
restart itself is not recorded long-term; it is handled and never worried about
again.

In order to implement a 'time-based' limit on X restarts, we would either need
to store more information in VF (such as an ancillary data block to record
restart histories), store the information locally (other nodes shouldn't care
about this information - since they're not running the service), or alter the
semantics of how parts of the rg_state_t structure are used:

typedef struct {
        char            rs_name[64];    /**< Service name */
        uint64_t        rs_owner;       /**< Member ID running service. */
        uint64_t        rs_last_owner;  /**< Last member to run the service. */
        uint32_t        rs_state;       /**< State of service. */
        uint32_t        rs_restarts;    /**< Number of cluster-induced 
                                             restarts */
        uint64_t        rs_transition;  /**< Last service transition time */
        uint32_t        rs_id;          /**< Service ID */
        uint32_t        rs_pad;         /**< pad to 64-bit boundary */
} rg_state_t;

(and utilize the rs_pad field for something...).  Basically, changing the size
of the above structure can not be done - it will break rolling upgrade to do so.

Comment 2 Lon Hohberger 2007-07-11 19:59:08 UTC

With a node-local recording of cluster-induced restarts, it is very easy to
throttle restarts based on X in Y time.

Comment 5 Lon Hohberger 2007-08-21 20:36:44 UTC

Created attachment 162009 [details]
Patch. pass 1.

Comment 6 Lon Hohberger 2007-08-21 20:41:18 UTC

Created attachment 162010 [details]
Pass 2; adds support to the resource-agent so that it's picked up via the config

Note: does not do time-based throttling; only a hard limit.

Comment 7 Lon Hohberger 2007-08-21 20:43:18 UTC

The rs_id and rs_pad fields are not used by rgmanager.  We could use these
fields as a "first-start" time.

Comment 8 Lon Hohberger 2007-08-21 20:44:17 UTC

(It's not even endian-swapped in reslist.h)

Comment 11 Lon Hohberger 2008-04-15 15:07:16 UTC

Pushed to RHEL4 git branch

Comment 14 errata-xmlrpc 2008-07-25 19:15:09 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2008-0791.html

Note You need to log in before you can comment on or make changes to this bug.