Description of problem: If one node in 3-nodes cluster suffers network outage it gets fenced after certain period. When the fencing is some "softer" method (i.e. scsi_fence, not power fence) the node can request rejoin as soon as the outage is over. When this happens, it gets rejected. Up to this point everything works as expected. But seconds the reject happens the remaining two healthy nodes start filling up the console with kernel BUGs and both have dlm_send process eating 100% of CPU power. The only way out is machine reset (i.e. whole cluster restart). Version-Release number of selected component (if applicable): cman-2.0.115-6.el5 openais-0.80.6-8.el5 kernel-2.6.18-164.el5 Linux a2 2.6.18-164.el5 #1 SMP Tue Aug 18 15:54:55 EDT 2009 ia64 ia64 ia64 GNU/Linux How reproducible: 100% Steps to Reproduce: 1. Form a scsi_fence configured 3-node cluster 2. log in to one of the nodes and disable outgoing traffic (iptables -I OUTPUT -j DROP) 3. wait until the node get's fenced and cluster operations resume in 2-node mode 4. restore network connectivity (iptables -D OUTPUT 1) 5. affected node is rejected and all others end up with dlm_send hang and console messages Actual results: cluster hang Expected results: rejected node is kicked, other 2 continue as if the rejoin request never happened. Additional info: see attachments. nodes are a1, a2, a3. The affected node was a3.
Created attachment 359694 [details] a1 /var/log/messages
Created attachment 359695 [details] a2 /var/log/messages
Created attachment 359696 [details] a3 /var/log/messages
I believe this is fixed by upstream patch queued for 2.6.32 merge window: http://people.redhat.com/~teigland/0001-dlm-fix-connection-close-handling.patch bug 508829 was actually a fix for a symptom of this larger bug.
in kernel-2.6.18-179.el5 You can download this test kernel from http://people.redhat.com/dzickus/el5 Please update the appropriate value in the Verified field (cf_verified) to indicate this fix has been successfully verified. Include a comment with verification details.
~~ Attention Customers and Partners - RHEL 5.5 Beta is now available on RHN ~~ RHEL 5.5 Beta has been released! There should be a fix present in this release that addresses your request. Please test and report back results here, by March 3rd 2010 (2010-03-03) or sooner. Upon successful verification of this request, post your results and update the Verified field in Bugzilla with the appropriate value. If you encounter any issues while testing, please describe them and set this bug into NEED_INFO. If you encounter new defects or have additional patch(es) to request for inclusion, please clone this bug per each request and escalate through your support representative.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2010-0178.html