Bug 149345

Summary: service/device failure leads to uncontrolled fail overs
Product: [Retired] Red Hat Cluster Suite Reporter: Dean Elling <dean.elling>
Component: clumanagerAssignee: Lon Hohberger <lhh>
Status: CLOSED WONTFIX QA Contact: Cluster QE <mspqa-list>
Severity: medium Docs Contact:
Priority: medium    
Version: 3CC: cluster-maint, rkenna
Target Milestone: ---   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2006-02-02 16:05:47 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 169576    

Description Dean Elling 2005-02-22 17:10:03 UTC
Description of problem:

We are using the RH cluster manager to maintain a highly-available
database. The database service has a device associated with it which
holds the database table space files, index files, log files and all
else associated with the database. On our system, the database files
became corrupt and the status of the database service returned a
failure. The RH cluster manager attempted to restart the service but
the database was still corrupt so this resulted in a fail over to
another node. Since the corrupt database on the service device moved
along with the service to another node, the service failed on the new
node as well and this then resulted in yet another fail over to
another node. This continued until the clumanager was stopped on the
cluster nodes.

We would like to have a mechanism within the clumanager for the
throttling of fail overs such that a configurable number of fail overs
in a configurable amount of time would trigger the chkconfig 'off' of
the clumanager.

This is not a request for the resolution of the problem that a service
is encountering. This is a request for a means for controlling the
number of continuous fail overs before stopping the clumanager.

Version-Release number of selected component (if applicable):

We are currently running clumanager-1.2.24.

How reproducible:

Define a service device for a highly-available database service. Cause
some corruption in the files on the service device such that the
status of the database returns failure and initiates fail over in the
cluster manager.

Steps to Reproduce:
1. Configure highly-available service (database) in the RH cluster manager
2. Corrupt the database administrative files on the database service
device
3. The database service status returns failure
4. Restart of the database service fails
5. Fail over is initiated by cluster manager
6. The database service device is mounted on another node and the
service is started but returns failure
  
Actual results:

The fail over of the service ping-pongs among nodes

Expected results:

A configurable number of fail overs are attempted in a configurable
amount of time before the clumanager is chconfig'ed off

Additional info:

Comment 1 Lon Hohberger 2005-02-22 19:21:21 UTC
Please file a ticket with Red Hat Support so this is properly tracked.  

It sounds like a generally useful feature.

Comment 3 Lon Hohberger 2005-10-04 21:16:26 UTC
This is possible if we split up the serviceblock structure's two uint16_t around
restarts / checks into two uint8_t.

We could add a max_faults - which would disable the service after so many
restarts (unlike max_restarts which merely relocates the service to another node
after the count is exceeded).

Comment 4 Lon Hohberger 2006-02-02 16:05:47 UTC
This would be difficult to implement while preserving rolling upgrade and not
breaking the on-disk format of service states.