Red Hat Bugzilla – Bug 521093
Cluster hangs after node rejoins from simulated network outage
Last modified: 2016-04-26 09:45:02 EDT
Description of problem:
If one node in 3-nodes cluster suffers network outage it gets fenced after certain period. When the fencing is some "softer" method (i.e. scsi_fence, not power fence) the node can request rejoin as soon as the outage is over. When this happens, it gets rejected. Up to this point everything works as expected.
But seconds the reject happens the remaining two healthy nodes start filling up the console with kernel BUGs and both have dlm_send process eating 100% of CPU power. The only way out is machine reset (i.e. whole cluster restart).
Version-Release number of selected component (if applicable):
Linux a2 2.6.18-164.el5 #1 SMP Tue Aug 18 15:54:55 EDT 2009 ia64 ia64 ia64 GNU/Linux
Steps to Reproduce:
1. Form a scsi_fence configured 3-node cluster
2. log in to one of the nodes and disable outgoing traffic (iptables -I OUTPUT -j DROP)
3. wait until the node get's fenced and cluster operations resume in 2-node mode
4. restore network connectivity (iptables -D OUTPUT 1)
5. affected node is rejected and all others end up with dlm_send hang and console messages
rejected node is kicked, other 2 continue as if the rejoin request never happened.
see attachments. nodes are a1, a2, a3. The affected node was a3.
Created attachment 359694 [details]
Created attachment 359695 [details]
Created attachment 359696 [details]
I believe this is fixed by upstream patch queued for 2.6.32 merge window:
bug 508829 was actually a fix for a symptom of this larger bug.
You can download this test kernel from http://people.redhat.com/dzickus/el5
Please update the appropriate value in the Verified field
(cf_verified) to indicate this fix has been successfully
verified. Include a comment with verification details.
~~ Attention Customers and Partners - RHEL 5.5 Beta is now available on RHN ~~
RHEL 5.5 Beta has been released! There should be a fix present in this
release that addresses your request. Please test and report back results
here, by March 3rd 2010 (2010-03-03) or sooner.
Upon successful verification of this request, post your results and update
the Verified field in Bugzilla with the appropriate value.
If you encounter any issues while testing, please describe them and set
this bug into NEED_INFO. If you encounter new defects or have additional
patch(es) to request for inclusion, please clone this bug per each request
and escalate through your support representative.
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.