Bug 149345 - service/device failure leads to uncontrolled fail overs
service/device failure leads to uncontrolled fail overs
Product: Red Hat Cluster Suite
Classification: Red Hat
Component: clumanager (Show other bugs)
All Linux
medium Severity medium
: ---
: ---
Assigned To: Lon Hohberger
Cluster QE
Depends On:
Blocks: 169576
  Show dependency treegraph
Reported: 2005-02-22 12:10 EST by Dean Elling
Modified: 2009-04-16 16:16 EDT (History)
2 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Last Closed: 2006-02-02 11:05:47 EST
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---

Attachments (Terms of Use)

  None (edit)
Description Dean Elling 2005-02-22 12:10:03 EST
Description of problem:

We are using the RH cluster manager to maintain a highly-available
database. The database service has a device associated with it which
holds the database table space files, index files, log files and all
else associated with the database. On our system, the database files
became corrupt and the status of the database service returned a
failure. The RH cluster manager attempted to restart the service but
the database was still corrupt so this resulted in a fail over to
another node. Since the corrupt database on the service device moved
along with the service to another node, the service failed on the new
node as well and this then resulted in yet another fail over to
another node. This continued until the clumanager was stopped on the
cluster nodes.

We would like to have a mechanism within the clumanager for the
throttling of fail overs such that a configurable number of fail overs
in a configurable amount of time would trigger the chkconfig 'off' of
the clumanager.

This is not a request for the resolution of the problem that a service
is encountering. This is a request for a means for controlling the
number of continuous fail overs before stopping the clumanager.

Version-Release number of selected component (if applicable):

We are currently running clumanager-1.2.24.

How reproducible:

Define a service device for a highly-available database service. Cause
some corruption in the files on the service device such that the
status of the database returns failure and initiates fail over in the
cluster manager.

Steps to Reproduce:
1. Configure highly-available service (database) in the RH cluster manager
2. Corrupt the database administrative files on the database service
3. The database service status returns failure
4. Restart of the database service fails
5. Fail over is initiated by cluster manager
6. The database service device is mounted on another node and the
service is started but returns failure
Actual results:

The fail over of the service ping-pongs among nodes

Expected results:

A configurable number of fail overs are attempted in a configurable
amount of time before the clumanager is chconfig'ed off

Additional info:
Comment 1 Lon Hohberger 2005-02-22 14:21:21 EST
Please file a ticket with Red Hat Support so this is properly tracked.  

It sounds like a generally useful feature.
Comment 3 Lon Hohberger 2005-10-04 17:16:26 EDT
This is possible if we split up the serviceblock structure's two uint16_t around
restarts / checks into two uint8_t.

We could add a max_faults - which would disable the service after so many
restarts (unlike max_restarts which merely relocates the service to another node
after the count is exceeded).
Comment 4 Lon Hohberger 2006-02-02 11:05:47 EST
This would be difficult to implement while preserving rolling upgrade and not
breaking the on-disk format of service states.

Note You need to log in before you can comment on or make changes to this bug.