Bug 245381

Summary:

[RFE] Restart counters before a switch to relocate.

Product:

[Retired] Red Hat Cluster Suite

Reporter:

Charlie Wyse <cwyse>

Component:

rgmanager

Assignee:

Lon Hohberger <lhh>

Status:

CLOSED ERRATA

QA Contact:

Cluster QE <mspqa-list>

Severity:

low

Docs Contact:

Priority:

medium

Version:

CC:

cluster-maint

Target Milestone:

---

Target Release:

---

Hardware:

All

OS:

Linux

Whiteboard:

Fixed In Version:

RHBA-2008-0791

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2008-07-25 19:15:09 UTC

Type:

---

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
Patch. pass 1.	none
Pass 2; adds support to the resource-agent so that it's picked up via the config	none

Description Charlie Wyse 2007-06-22 18:54:21 UTC

From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.0.10) Gecko/20070313 Fedora/1.5.0.10-5.fc6 Firefox/1.5.0.10

Description of problem:
Working with a customer that requires for a services to restart locally.  If the service is restarted locally for X amount of times in Y time frame.  For example, 3 times in 1 hour.  Then the services is relocated to a differnt node in the cluster.

Version-Release number of selected component (if applicable):


How reproducible:
Always


Steps to Reproduce:
1. Service has a problem, possibly due to hardware and is restarted.
2. The problem happens again and is restarted.
3. This creates a loop where the service is started over and over again but never relocated to a different machine.

Actual Results:
The services just keeps restarting and never relocates to a new server.  This could be damanging as it requires manual intervention.  

Expected Results:
You should be able to set times in the services tab that allow you to specify that if the service is restarted X amount of times within Y time frame.  To just relocate the service to a seperate machine.

Additional info:
The X for times and Y for time frame values are a must as some customers might find that 3 restarts in an hour is is a good limit. While some customers might only want 2 restarts in a 24 hours period.  Or possibly a week long period.

Comment 1 Lon Hohberger 2007-07-11 19:57:26 UTC

The restarts themselves are not tracked currently in rgmanager; that is, a
restart itself is not recorded long-term; it is handled and never worried about
again.

In order to implement a 'time-based' limit on X restarts, we would either need
to store more information in VF (such as an ancillary data block to record
restart histories), store the information locally (other nodes shouldn't care
about this information - since they're not running the service), or alter the
semantics of how parts of the rg_state_t structure are used:

typedef struct {
        char            rs_name[64];    /**< Service name */
        uint64_t        rs_owner;       /**< Member ID running service. */
        uint64_t        rs_last_owner;  /**< Last member to run the service. */
        uint32_t        rs_state;       /**< State of service. */
        uint32_t        rs_restarts;    /**< Number of cluster-induced 
                                             restarts */
        uint64_t        rs_transition;  /**< Last service transition time */
        uint32_t        rs_id;          /**< Service ID */
        uint32_t        rs_pad;         /**< pad to 64-bit boundary */
} rg_state_t;

(and utilize the rs_pad field for something...).  Basically, changing the size
of the above structure can not be done - it will break rolling upgrade to do so.

Comment 2 Lon Hohberger 2007-07-11 19:59:08 UTC

With a node-local recording of cluster-induced restarts, it is very easy to
throttle restarts based on X in Y time.

Comment 5 Lon Hohberger 2007-08-21 20:36:44 UTC

Created attachment 162009 [details]
Patch. pass 1.

Comment 6 Lon Hohberger 2007-08-21 20:41:18 UTC

Created attachment 162010 [details]
Pass 2; adds support to the resource-agent so that it's picked up via the config

Note: does not do time-based throttling; only a hard limit.

Comment 7 Lon Hohberger 2007-08-21 20:43:18 UTC

The rs_id and rs_pad fields are not used by rgmanager.  We could use these
fields as a "first-start" time.

Comment 8 Lon Hohberger 2007-08-21 20:44:17 UTC

(It's not even endian-swapped in reslist.h)

Comment 11 Lon Hohberger 2008-04-15 15:07:16 UTC

Pushed to RHEL4 git branch

Comment 14 errata-xmlrpc 2008-07-25 19:15:09 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2008-0791.html