Description of problem: When running, clumanager will try to continue to operate even if the local file systems containg /etc and /usr/lib stop working properly. For example, pulling the disk out of the machine while it's running. Services can not fail over, and, in fact, enter the 'failed' state because the service script (/usr/lib/clumanager/services/service) can not be executed when the file system is not there. Once a service has entered the 'failed' state, it requires manual intervention to restart. Version-Release number of selected component (if applicable): 1.2.22-2 How reproducible: 100% Steps to Reproduce: 1. Start clumanager 2. Pull local disks out of node Actual results: I/O errors on the console; no action taken. Processes hang, but heartbeat continues to operate so the node is never declared 'out' of the cluster. Expected results: Unknown. Typically, when clumanager enters a state where it can not recover automagically, it reboots in the hopes that a clean boot will save it. Additional information: I have implemented a thread which can be added to one of the daemons in order to monitor the kernel's log messages for I/O errors on the root and /usr partitions. More work is necessary to have the service manager aware of when the local disks are misbehaving (for example, while trying to exec the service script), but this is a step in the right direction.
(1) Watching the device for I/O errors works in the single-disk case. This is fairly obvious. (2) Pulling one of the disks in a software RAID/LVM set will not generate I/O errors at the RAID/LVM level, so this is covered. (3) Pulling all disks in a software RAID/LVM set has not been tested yet. The assumption is that it will generate I/O errors, but this is not necessarily the case. [TODO] (4) I/O error checking will work fine in any hardware or host-RAID controller (without additional software RAID/LVM), given that they appear as regular SCSI disks to the host.
A better implementation of this would be similar to our disk monitoring application which we've crafted for RHEL4 to supplant the previous quorum-disk monitoring we did.