Bug 521093 - Cluster hangs after node rejoins from simulated network outage
Summary: Cluster hangs after node rejoins from simulated network outage
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: kernel
Version: 5.4
Hardware: All
OS: Linux
medium
medium
Target Milestone: rc
: ---
Assignee: David Teigland
QA Contact: Cluster QE
URL:
Whiteboard:
Depends On:
Blocks: 533192
TreeView+ depends on / blocked
 
Reported: 2009-09-03 15:27 UTC by Jaroslav Kortus
Modified: 2018-10-27 15:40 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2010-03-30 07:39:41 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
a1 /var/log/messages (16.77 KB, text/plain)
2009-09-03 15:28 UTC, Jaroslav Kortus
no flags Details
a2 /var/log/messages (20.53 KB, text/plain)
2009-09-03 15:29 UTC, Jaroslav Kortus
no flags Details
a3 /var/log/messages (4.06 KB, text/plain)
2009-09-03 15:29 UTC, Jaroslav Kortus
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2010:0178 0 normal SHIPPED_LIVE Important: Red Hat Enterprise Linux 5.5 kernel security and bug fix update 2010-03-29 12:18:21 UTC

Description Jaroslav Kortus 2009-09-03 15:27:32 UTC
Description of problem:
If one node in 3-nodes cluster suffers network outage it gets fenced after certain period. When the fencing is some "softer" method (i.e. scsi_fence, not power fence) the node can request rejoin as soon as the outage is over. When this happens, it gets rejected. Up to this point everything works as expected.

But seconds the reject happens the remaining two healthy nodes start filling up the console with kernel BUGs and both have dlm_send process eating 100% of CPU power. The only way out is machine reset (i.e. whole cluster restart).

Version-Release number of selected component (if applicable):
cman-2.0.115-6.el5
openais-0.80.6-8.el5
kernel-2.6.18-164.el5
Linux a2 2.6.18-164.el5 #1 SMP Tue Aug 18 15:54:55 EDT 2009 ia64 ia64 ia64 GNU/Linux

How reproducible:
100%

Steps to Reproduce:
1. Form a scsi_fence configured 3-node cluster
2. log in to one of the nodes and disable outgoing traffic (iptables -I OUTPUT -j DROP)
3. wait until the node get's fenced and cluster operations resume in 2-node mode
4. restore network connectivity (iptables -D OUTPUT 1)
5. affected node is rejected and all others end up with dlm_send hang and console messages
  
Actual results:
cluster hang

Expected results:
rejected node is kicked, other 2 continue as if the rejoin request never happened.

Additional info:
see attachments. nodes are a1, a2, a3. The affected node was a3.

Comment 1 Jaroslav Kortus 2009-09-03 15:28:31 UTC
Created attachment 359694 [details]
a1 /var/log/messages

Comment 2 Jaroslav Kortus 2009-09-03 15:29:02 UTC
Created attachment 359695 [details]
a2 /var/log/messages

Comment 3 Jaroslav Kortus 2009-09-03 15:29:24 UTC
Created attachment 359696 [details]
a3 /var/log/messages

Comment 4 David Teigland 2009-09-03 15:44:48 UTC
I believe this is fixed by upstream patch queued for 2.6.32 merge window:
http://people.redhat.com/~teigland/0001-dlm-fix-connection-close-handling.patch

bug 508829 was actually a fix for a symptom of this larger bug.

Comment 8 Don Zickus 2009-12-11 19:29:47 UTC
in kernel-2.6.18-179.el5
You can download this test kernel from http://people.redhat.com/dzickus/el5

Please update the appropriate value in the Verified field
(cf_verified) to indicate this fix has been successfully
verified. Include a comment with verification details.

Comment 10 Chris Ward 2010-02-11 10:20:02 UTC
~~ Attention Customers and Partners - RHEL 5.5 Beta is now available on RHN ~~

RHEL 5.5 Beta has been released! There should be a fix present in this 
release that addresses your request. Please test and report back results 
here, by March 3rd 2010 (2010-03-03) or sooner.

Upon successful verification of this request, post your results and update 
the Verified field in Bugzilla with the appropriate value.

If you encounter any issues while testing, please describe them and set 
this bug into NEED_INFO. If you encounter new defects or have additional 
patch(es) to request for inclusion, please clone this bug per each request
and escalate through your support representative.

Comment 13 errata-xmlrpc 2010-03-30 07:39:41 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2010-0178.html


Note You need to log in before you can comment on or make changes to this bug.