149345 – service/device failure leads to uncontrolled fail overs

Bug 149345 - service/device failure leads to uncontrolled fail overs

Summary: service/device failure leads to uncontrolled fail overs

Keywords:
Status:	CLOSED WONTFIX
Alias:	None
Product:	Red Hat Cluster Suite
Classification:	Retired
Component:	clumanager
Sub Component:
Version:	3
Hardware:	All
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	---
Assignee:	Lon Hohberger
QA Contact:	Cluster QE
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	169576
TreeView+	depends on / blocked

Reported:	2005-02-22 17:10 UTC by Dean Elling
Modified:	2009-04-16 20:16 UTC (History)
CC List:	2 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2006-02-02 16:05:47 UTC
Embargoed:

Attachments	(Terms of Use)

Description Dean Elling 2005-02-22 17:10:03 UTC

Description of problem:

We are using the RH cluster manager to maintain a highly-available
database. The database service has a device associated with it which
holds the database table space files, index files, log files and all
else associated with the database. On our system, the database files
became corrupt and the status of the database service returned a
failure. The RH cluster manager attempted to restart the service but
the database was still corrupt so this resulted in a fail over to
another node. Since the corrupt database on the service device moved
along with the service to another node, the service failed on the new
node as well and this then resulted in yet another fail over to
another node. This continued until the clumanager was stopped on the
cluster nodes.

We would like to have a mechanism within the clumanager for the
throttling of fail overs such that a configurable number of fail overs
in a configurable amount of time would trigger the chkconfig 'off' of
the clumanager.

This is not a request for the resolution of the problem that a service
is encountering. This is a request for a means for controlling the
number of continuous fail overs before stopping the clumanager.

Version-Release number of selected component (if applicable):

We are currently running clumanager-1.2.24.

How reproducible:

Define a service device for a highly-available database service. Cause
some corruption in the files on the service device such that the
status of the database returns failure and initiates fail over in the
cluster manager.

Steps to Reproduce:
1. Configure highly-available service (database) in the RH cluster manager
2. Corrupt the database administrative files on the database service
device
3. The database service status returns failure
4. Restart of the database service fails
5. Fail over is initiated by cluster manager
6. The database service device is mounted on another node and the
service is started but returns failure
  
Actual results:

The fail over of the service ping-pongs among nodes

Expected results:

A configurable number of fail overs are attempted in a configurable
amount of time before the clumanager is chconfig'ed off

Additional info:

Comment 1 Lon Hohberger 2005-02-22 19:21:21 UTC

Please file a ticket with Red Hat Support so this is properly tracked.  

It sounds like a generally useful feature.

Comment 3 Lon Hohberger 2005-10-04 21:16:26 UTC

This is possible if we split up the serviceblock structure's two uint16_t around
restarts / checks into two uint8_t.

We could add a max_faults - which would disable the service after so many
restarts (unlike max_restarts which merely relocates the service to another node
after the count is exceeded).

Comment 4 Lon Hohberger 2006-02-02 16:05:47 UTC

This would be difficult to implement while preserving rolling upgrade and not
breaking the on-disk format of service states.

Note You need to log in before you can comment on or make changes to this bug.