Bug 147528 - RFE: clumanager should monitor local disks for I/O errors
Summary: RFE: clumanager should monitor local disks for I/O errors
Keywords:
Status: CLOSED DEFERRED
Alias: None
Product: Red Hat Cluster Suite
Classification: Retired
Component: clumanager
Version: 3
Hardware: All
OS: Linux
medium
medium
Target Milestone: ---
Assignee: Lon Hohberger
QA Contact: Cluster QE
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2005-02-08 20:13 UTC by Lon Hohberger
Modified: 2009-04-16 20:16 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2006-01-27 21:07:05 UTC
Embargoed:


Attachments (Terms of Use)

Description Lon Hohberger 2005-02-08 20:13:10 UTC
Description of problem:

When running, clumanager will try to continue to operate even if the
local file systems containg /etc and /usr/lib stop working properly. 
For example, pulling the disk out of the machine while it's running.  

Services can not fail over, and, in fact, enter the 'failed' state
because the service script (/usr/lib/clumanager/services/service) can
not be executed when the file system is not there.  Once a service has
entered the 'failed' state, it requires manual intervention to restart.


Version-Release number of selected component (if applicable): 1.2.22-2


How reproducible: 100%


Steps to Reproduce:
1. Start clumanager
2. Pull local disks out of node
  
Actual results:
I/O errors on the console; no action taken.  Processes hang, but
heartbeat continues to operate so the node is never declared 'out' of
the cluster.


Expected results:
Unknown.  Typically, when clumanager enters a state where it can not
recover automagically, it reboots in the hopes that a clean boot will
save it.


Additional information:
I have implemented a thread which can be added to one of the daemons
in order to monitor the kernel's log messages for I/O errors on the
root and /usr partitions.

More work is necessary to have the service manager aware of when the
local disks are misbehaving (for example, while trying to exec the
service script), but this is a step in the right direction.

Comment 2 Lon Hohberger 2005-02-08 21:10:06 UTC
(1) Watching the device for I/O errors works in the single-disk case.
 This is fairly obvious.

(2) Pulling one of the disks in a software RAID/LVM set will not
generate I/O errors at the RAID/LVM level, so this is covered.

(3) Pulling all disks in a software RAID/LVM set has not been tested
yet.  The assumption is that it will generate I/O errors, but this is
not necessarily the case. [TODO]

(4) I/O error checking will work fine in any hardware or host-RAID
controller (without additional software RAID/LVM), given that they
appear as regular SCSI disks to the host.


Comment 4 Lon Hohberger 2005-10-04 21:17:41 UTC
A better implementation of this would be similar to our disk monitoring
application which we've crafted for RHEL4 to supplant the previous quorum-disk
monitoring we did.


Note You need to log in before you can comment on or make changes to this bug.