Red Hat Bugzilla – Bug 149345
service/device failure leads to uncontrolled fail overs
Last modified: 2009-04-16 16:16:23 EDT
Description of problem:
We are using the RH cluster manager to maintain a highly-available
database. The database service has a device associated with it which
holds the database table space files, index files, log files and all
else associated with the database. On our system, the database files
became corrupt and the status of the database service returned a
failure. The RH cluster manager attempted to restart the service but
the database was still corrupt so this resulted in a fail over to
another node. Since the corrupt database on the service device moved
along with the service to another node, the service failed on the new
node as well and this then resulted in yet another fail over to
another node. This continued until the clumanager was stopped on the
We would like to have a mechanism within the clumanager for the
throttling of fail overs such that a configurable number of fail overs
in a configurable amount of time would trigger the chkconfig 'off' of
This is not a request for the resolution of the problem that a service
is encountering. This is a request for a means for controlling the
number of continuous fail overs before stopping the clumanager.
Version-Release number of selected component (if applicable):
We are currently running clumanager-1.2.24.
Define a service device for a highly-available database service. Cause
some corruption in the files on the service device such that the
status of the database returns failure and initiates fail over in the
Steps to Reproduce:
1. Configure highly-available service (database) in the RH cluster manager
2. Corrupt the database administrative files on the database service
3. The database service status returns failure
4. Restart of the database service fails
5. Fail over is initiated by cluster manager
6. The database service device is mounted on another node and the
service is started but returns failure
The fail over of the service ping-pongs among nodes
A configurable number of fail overs are attempted in a configurable
amount of time before the clumanager is chconfig'ed off
Please file a ticket with Red Hat Support so this is properly tracked.
It sounds like a generally useful feature.
This is possible if we split up the serviceblock structure's two uint16_t around
restarts / checks into two uint8_t.
We could add a max_faults - which would disable the service after so many
restarts (unlike max_restarts which merely relocates the service to another node
after the count is exceeded).
This would be difficult to implement while preserving rolling upgrade and not
breaking the on-disk format of service states.