Bug 147528 - RFE: clumanager should monitor local disks for I/O errors
RFE: clumanager should monitor local disks for I/O errors
Status: CLOSED DEFERRED
Product: Red Hat Cluster Suite
Classification: Red Hat
Component: clumanager (Show other bugs)
3
All Linux
medium Severity medium
: ---
: ---
Assigned To: Lon Hohberger
Cluster QE
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2005-02-08 15:13 EST by Lon Hohberger
Modified: 2009-04-16 16:16 EDT (History)
3 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2006-01-27 16:07:05 EST
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:


Attachments (Terms of Use)

  None (edit)
Description Lon Hohberger 2005-02-08 15:13:10 EST
Description of problem:

When running, clumanager will try to continue to operate even if the
local file systems containg /etc and /usr/lib stop working properly. 
For example, pulling the disk out of the machine while it's running.  

Services can not fail over, and, in fact, enter the 'failed' state
because the service script (/usr/lib/clumanager/services/service) can
not be executed when the file system is not there.  Once a service has
entered the 'failed' state, it requires manual intervention to restart.


Version-Release number of selected component (if applicable): 1.2.22-2


How reproducible: 100%


Steps to Reproduce:
1. Start clumanager
2. Pull local disks out of node
  
Actual results:
I/O errors on the console; no action taken.  Processes hang, but
heartbeat continues to operate so the node is never declared 'out' of
the cluster.


Expected results:
Unknown.  Typically, when clumanager enters a state where it can not
recover automagically, it reboots in the hopes that a clean boot will
save it.


Additional information:
I have implemented a thread which can be added to one of the daemons
in order to monitor the kernel's log messages for I/O errors on the
root and /usr partitions.

More work is necessary to have the service manager aware of when the
local disks are misbehaving (for example, while trying to exec the
service script), but this is a step in the right direction.
Comment 2 Lon Hohberger 2005-02-08 16:10:06 EST
(1) Watching the device for I/O errors works in the single-disk case.
 This is fairly obvious.

(2) Pulling one of the disks in a software RAID/LVM set will not
generate I/O errors at the RAID/LVM level, so this is covered.

(3) Pulling all disks in a software RAID/LVM set has not been tested
yet.  The assumption is that it will generate I/O errors, but this is
not necessarily the case. [TODO]

(4) I/O error checking will work fine in any hardware or host-RAID
controller (without additional software RAID/LVM), given that they
appear as regular SCSI disks to the host.
Comment 4 Lon Hohberger 2005-10-04 17:17:41 EDT
A better implementation of this would be similar to our disk monitoring
application which we've crafted for RHEL4 to supplant the previous quorum-disk
monitoring we did.

Note You need to log in before you can comment on or make changes to this bug.