Bug 147528

Summary: RFE: clumanager should monitor local disks for I/O errors
Product: [Retired] Red Hat Cluster Suite Reporter: Lon Hohberger <lhh>
Component: clumanagerAssignee: Lon Hohberger <lhh>
Status: CLOSED DEFERRED QA Contact: Cluster QE <mspqa-list>
Severity: medium Docs Contact:
Priority: medium    
Version: 3CC: cluster-maint, kanderso, rkenna
Target Milestone: ---   
Target Release: ---   
Hardware: All   
OS: Linux   
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2006-01-27 21:07:05 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---

Description Lon Hohberger 2005-02-08 20:13:10 UTC
Description of problem:

When running, clumanager will try to continue to operate even if the
local file systems containg /etc and /usr/lib stop working properly. 
For example, pulling the disk out of the machine while it's running.  

Services can not fail over, and, in fact, enter the 'failed' state
because the service script (/usr/lib/clumanager/services/service) can
not be executed when the file system is not there.  Once a service has
entered the 'failed' state, it requires manual intervention to restart.

Version-Release number of selected component (if applicable): 1.2.22-2

How reproducible: 100%

Steps to Reproduce:
1. Start clumanager
2. Pull local disks out of node
Actual results:
I/O errors on the console; no action taken.  Processes hang, but
heartbeat continues to operate so the node is never declared 'out' of
the cluster.

Expected results:
Unknown.  Typically, when clumanager enters a state where it can not
recover automagically, it reboots in the hopes that a clean boot will
save it.

Additional information:
I have implemented a thread which can be added to one of the daemons
in order to monitor the kernel's log messages for I/O errors on the
root and /usr partitions.

More work is necessary to have the service manager aware of when the
local disks are misbehaving (for example, while trying to exec the
service script), but this is a step in the right direction.

Comment 2 Lon Hohberger 2005-02-08 21:10:06 UTC
(1) Watching the device for I/O errors works in the single-disk case.
 This is fairly obvious.

(2) Pulling one of the disks in a software RAID/LVM set will not
generate I/O errors at the RAID/LVM level, so this is covered.

(3) Pulling all disks in a software RAID/LVM set has not been tested
yet.  The assumption is that it will generate I/O errors, but this is
not necessarily the case. [TODO]

(4) I/O error checking will work fine in any hardware or host-RAID
controller (without additional software RAID/LVM), given that they
appear as regular SCSI disks to the host.

Comment 4 Lon Hohberger 2005-10-04 21:17:41 UTC
A better implementation of this would be similar to our disk monitoring
application which we've crafted for RHEL4 to supplant the previous quorum-disk
monitoring we did.