+++ This bug was initially created as a clone of Bug #500450 +++ Description of problem: In some situations, qdiskd can hang on I/O to shared storage. Currently, when this happens, the only bread crumbs visible are on the other nodes, where they report (at debug log level): debug: Node 1 has missed an update 6/10 This is only noticeable if the administrator has configured qdiskd to use the DEBUG log level, and is a poor method to indicate errors. The purpose of this feature request is to allow qdiskd to report I/O hangs on the node where the occur instead at the WARNING log level instead of DEBUG: warning: qdiskd: write (system call) has hung for 5 seconds warning: In 5 more seconds, we will be evicted warning: qdisk cycle took more than 1 second to complete (6.020000) Presence of such a warning indicates that qdiskd is not at fault for a given failure to write, and gives administrators the ability to chase down or tune around I/O performance problems within their SAN environment. The patch as designed implements a very simple state-checker thread since it was less invasive/destabilizing than switching qdiskd's syscalls to AIO (which is the other possible implementation). --- Additional comment from lhh on 2009-05-12 14:26:20 EDT --- Created an attachment (id=343640) Implementation
Created attachment 343641 [details] Implementation (rhel4)
http://git.fedorahosted.org/git/?p=cluster.git;a=commit;h=779a71b145323ad97e5b73a58178e20a357b5a11
Technical note added. If any revisions are required, please edit the "Technical Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. New Contents: Previously, Qdiskd on RHEL4 did not check if input/output (I/O) failed for tko interval times,relying only on cman kill to evict a node. With this update, Qdisk logs better when it becomes suspended on Input/Output.
Technical note updated. If any revisions are required, please edit the "Technical Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. Diffed Contents: @@ -1 +1 @@ -Previously, Qdiskd on RHEL4 did not check if input/output (I/O) failed for tko interval times,relying only on cman kill to evict a node. With this update, Qdisk logs better when it becomes suspended on Input/Output.+Previously, Qdiskd on RHEL4 did not check if input/output (I/O) failed for tko interval times,relying only on cman kill to evict a node. With this update, Qdisk logs better when it becomes suspended on I/O.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2011-0271.html