Red Hat Bugzilla – Bug 147528
RFE: clumanager should monitor local disks for I/O errors
Last modified: 2009-04-16 16:16:20 EDT
Description of problem:
When running, clumanager will try to continue to operate even if the
local file systems containg /etc and /usr/lib stop working properly.
For example, pulling the disk out of the machine while it's running.
Services can not fail over, and, in fact, enter the 'failed' state
because the service script (/usr/lib/clumanager/services/service) can
not be executed when the file system is not there. Once a service has
entered the 'failed' state, it requires manual intervention to restart.
Version-Release number of selected component (if applicable): 1.2.22-2
How reproducible: 100%
Steps to Reproduce:
1. Start clumanager
2. Pull local disks out of node
I/O errors on the console; no action taken. Processes hang, but
heartbeat continues to operate so the node is never declared 'out' of
Unknown. Typically, when clumanager enters a state where it can not
recover automagically, it reboots in the hopes that a clean boot will
I have implemented a thread which can be added to one of the daemons
in order to monitor the kernel's log messages for I/O errors on the
root and /usr partitions.
More work is necessary to have the service manager aware of when the
local disks are misbehaving (for example, while trying to exec the
service script), but this is a step in the right direction.
(1) Watching the device for I/O errors works in the single-disk case.
This is fairly obvious.
(2) Pulling one of the disks in a software RAID/LVM set will not
generate I/O errors at the RAID/LVM level, so this is covered.
(3) Pulling all disks in a software RAID/LVM set has not been tested
yet. The assumption is that it will generate I/O errors, but this is
not necessarily the case. [TODO]
(4) I/O error checking will work fine in any hardware or host-RAID
controller (without additional software RAID/LVM), given that they
appear as regular SCSI disks to the host.
A better implementation of this would be similar to our disk monitoring
application which we've crafted for RHEL4 to supplant the previous quorum-disk
monitoring we did.