Description of problem: In some situations, qdiskd can hang on I/O to shared storage. Currently, when this happens, the only bread crumbs visible are on the other nodes, where they report (at debug log level): debug: Node 1 has missed an update 6/10 This is only noticeable if the administrator has configured qdiskd to use the DEBUG log level, and is a poor method to indicate errors. The purpose of this feature request is to allow qdiskd to report I/O hangs on the node where the occur instead at the WARNING log level instead of DEBUG: warning: qdiskd: write (system call) has hung for 5 seconds warning: In 5 more seconds, we will be evicted warning: qdisk cycle took more than 1 second to complete (6.020000) Presence of such a warning indicates that qdiskd is not at fault for a given failure to write, and gives administrators the ability to chase down or tune around I/O performance problems within their SAN environment. The patch as designed implements a very simple state-checker thread since it was less invasive/destabilizing than switching qdiskd's syscalls to AIO (which is the other possible implementation).
Created attachment 343640 [details] Implementation
http://git.fedorahosted.org/git/?p=cluster.git;a=commit;h=83a61282601bff7dd26e8bcf4ebd4b1f38d6e25c
Cause / Consequence: This is a new feature. Fix: Add I/O hang reporting to qdiskd Result: Administrators can see I/O hang messages in logs on systems where they occur rather than on other systems in the cluster.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2009-1341.html