Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.
For bugs related to Red Hat Enterprise Linux 5 product line. The current stable release is 5.10. For Red Hat Enterprise Linux 6 and above, please visit Red Hat JIRA https://issues.redhat.com/secure/CreateIssue!default.jspa?pid=12332745 to report new issues.

Bug 521093

Summary: Cluster hangs after node rejoins from simulated network outage
Product: Red Hat Enterprise Linux 5 Reporter: Jaroslav Kortus <jkortus>
Component: kernelAssignee: David Teigland <teigland>
Status: CLOSED ERRATA QA Contact: Cluster QE <mspqa-list>
Severity: medium Docs Contact:
Priority: medium    
Version: 5.4CC: bcm, cluster-maint, cward, edamato, tao
Target Milestone: rc   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2010-03-30 07:39:41 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 533192    
Attachments:
Description Flags
a1 /var/log/messages
none
a2 /var/log/messages
none
a3 /var/log/messages none

Description Jaroslav Kortus 2009-09-03 15:27:32 UTC
Description of problem:
If one node in 3-nodes cluster suffers network outage it gets fenced after certain period. When the fencing is some "softer" method (i.e. scsi_fence, not power fence) the node can request rejoin as soon as the outage is over. When this happens, it gets rejected. Up to this point everything works as expected.

But seconds the reject happens the remaining two healthy nodes start filling up the console with kernel BUGs and both have dlm_send process eating 100% of CPU power. The only way out is machine reset (i.e. whole cluster restart).

Version-Release number of selected component (if applicable):
cman-2.0.115-6.el5
openais-0.80.6-8.el5
kernel-2.6.18-164.el5
Linux a2 2.6.18-164.el5 #1 SMP Tue Aug 18 15:54:55 EDT 2009 ia64 ia64 ia64 GNU/Linux

How reproducible:
100%

Steps to Reproduce:
1. Form a scsi_fence configured 3-node cluster
2. log in to one of the nodes and disable outgoing traffic (iptables -I OUTPUT -j DROP)
3. wait until the node get's fenced and cluster operations resume in 2-node mode
4. restore network connectivity (iptables -D OUTPUT 1)
5. affected node is rejected and all others end up with dlm_send hang and console messages
  
Actual results:
cluster hang

Expected results:
rejected node is kicked, other 2 continue as if the rejoin request never happened.

Additional info:
see attachments. nodes are a1, a2, a3. The affected node was a3.

Comment 1 Jaroslav Kortus 2009-09-03 15:28:31 UTC
Created attachment 359694 [details]
a1 /var/log/messages

Comment 2 Jaroslav Kortus 2009-09-03 15:29:02 UTC
Created attachment 359695 [details]
a2 /var/log/messages

Comment 3 Jaroslav Kortus 2009-09-03 15:29:24 UTC
Created attachment 359696 [details]
a3 /var/log/messages

Comment 4 David Teigland 2009-09-03 15:44:48 UTC
I believe this is fixed by upstream patch queued for 2.6.32 merge window:
http://people.redhat.com/~teigland/0001-dlm-fix-connection-close-handling.patch

bug 508829 was actually a fix for a symptom of this larger bug.

Comment 8 Don Zickus 2009-12-11 19:29:47 UTC
in kernel-2.6.18-179.el5
You can download this test kernel from http://people.redhat.com/dzickus/el5

Please update the appropriate value in the Verified field
(cf_verified) to indicate this fix has been successfully
verified. Include a comment with verification details.

Comment 10 Chris Ward 2010-02-11 10:20:02 UTC
~~ Attention Customers and Partners - RHEL 5.5 Beta is now available on RHN ~~

RHEL 5.5 Beta has been released! There should be a fix present in this 
release that addresses your request. Please test and report back results 
here, by March 3rd 2010 (2010-03-03) or sooner.

Upon successful verification of this request, post your results and update 
the Verified field in Bugzilla with the appropriate value.

If you encounter any issues while testing, please describe them and set 
this bug into NEED_INFO. If you encounter new defects or have additional 
patch(es) to request for inclusion, please clone this bug per each request
and escalate through your support representative.

Comment 13 errata-xmlrpc 2010-03-30 07:39:41 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2010-0178.html