Description of problem: We are using the RH cluster manager to maintain a highly-available database. The database service has a device associated with it which holds the database table space files, index files, log files and all else associated with the database. On our system, the database files became corrupt and the status of the database service returned a failure. The RH cluster manager attempted to restart the service but the database was still corrupt so this resulted in a fail over to another node. Since the corrupt database on the service device moved along with the service to another node, the service failed on the new node as well and this then resulted in yet another fail over to another node. This continued until the clumanager was stopped on the cluster nodes. We would like to have a mechanism within the clumanager for the throttling of fail overs such that a configurable number of fail overs in a configurable amount of time would trigger the chkconfig 'off' of the clumanager. This is not a request for the resolution of the problem that a service is encountering. This is a request for a means for controlling the number of continuous fail overs before stopping the clumanager. Version-Release number of selected component (if applicable): We are currently running clumanager-1.2.24. How reproducible: Define a service device for a highly-available database service. Cause some corruption in the files on the service device such that the status of the database returns failure and initiates fail over in the cluster manager. Steps to Reproduce: 1. Configure highly-available service (database) in the RH cluster manager 2. Corrupt the database administrative files on the database service device 3. The database service status returns failure 4. Restart of the database service fails 5. Fail over is initiated by cluster manager 6. The database service device is mounted on another node and the service is started but returns failure Actual results: The fail over of the service ping-pongs among nodes Expected results: A configurable number of fail overs are attempted in a configurable amount of time before the clumanager is chconfig'ed off Additional info:
Please file a ticket with Red Hat Support so this is properly tracked. It sounds like a generally useful feature.
This is possible if we split up the serviceblock structure's two uint16_t around restarts / checks into two uint8_t. We could add a max_faults - which would disable the service after so many restarts (unlike max_restarts which merely relocates the service to another node after the count is exceeded).
This would be difficult to implement while preserving rolling upgrade and not breaking the on-disk format of service states.