Bug 521093 - Cluster hangs after node rejoins from simulated network outage
Cluster hangs after node rejoins from simulated network outage
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: kernel (Show other bugs)
All Linux
medium Severity medium
: rc
: ---
Assigned To: David Teigland
Cluster QE
Depends On:
Blocks: 533192
  Show dependency treegraph
Reported: 2009-09-03 11:27 EDT by Jaroslav Kortus
Modified: 2016-04-26 09:45 EDT (History)
5 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Last Closed: 2010-03-30 03:39:41 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---

Attachments (Terms of Use)
a1 /var/log/messages (16.77 KB, text/plain)
2009-09-03 11:28 EDT, Jaroslav Kortus
no flags Details
a2 /var/log/messages (20.53 KB, text/plain)
2009-09-03 11:29 EDT, Jaroslav Kortus
no flags Details
a3 /var/log/messages (4.06 KB, text/plain)
2009-09-03 11:29 EDT, Jaroslav Kortus
no flags Details

  None (edit)
Description Jaroslav Kortus 2009-09-03 11:27:32 EDT
Description of problem:
If one node in 3-nodes cluster suffers network outage it gets fenced after certain period. When the fencing is some "softer" method (i.e. scsi_fence, not power fence) the node can request rejoin as soon as the outage is over. When this happens, it gets rejected. Up to this point everything works as expected.

But seconds the reject happens the remaining two healthy nodes start filling up the console with kernel BUGs and both have dlm_send process eating 100% of CPU power. The only way out is machine reset (i.e. whole cluster restart).

Version-Release number of selected component (if applicable):
Linux a2 2.6.18-164.el5 #1 SMP Tue Aug 18 15:54:55 EDT 2009 ia64 ia64 ia64 GNU/Linux

How reproducible:

Steps to Reproduce:
1. Form a scsi_fence configured 3-node cluster
2. log in to one of the nodes and disable outgoing traffic (iptables -I OUTPUT -j DROP)
3. wait until the node get's fenced and cluster operations resume in 2-node mode
4. restore network connectivity (iptables -D OUTPUT 1)
5. affected node is rejected and all others end up with dlm_send hang and console messages
Actual results:
cluster hang

Expected results:
rejected node is kicked, other 2 continue as if the rejoin request never happened.

Additional info:
see attachments. nodes are a1, a2, a3. The affected node was a3.
Comment 1 Jaroslav Kortus 2009-09-03 11:28:31 EDT
Created attachment 359694 [details]
a1 /var/log/messages
Comment 2 Jaroslav Kortus 2009-09-03 11:29:02 EDT
Created attachment 359695 [details]
a2 /var/log/messages
Comment 3 Jaroslav Kortus 2009-09-03 11:29:24 EDT
Created attachment 359696 [details]
a3 /var/log/messages
Comment 4 David Teigland 2009-09-03 11:44:48 EDT
I believe this is fixed by upstream patch queued for 2.6.32 merge window:

bug 508829 was actually a fix for a symptom of this larger bug.
Comment 8 Don Zickus 2009-12-11 14:29:47 EST
in kernel-2.6.18-179.el5
You can download this test kernel from http://people.redhat.com/dzickus/el5

Please update the appropriate value in the Verified field
(cf_verified) to indicate this fix has been successfully
verified. Include a comment with verification details.
Comment 10 Chris Ward 2010-02-11 05:20:02 EST
~~ Attention Customers and Partners - RHEL 5.5 Beta is now available on RHN ~~

RHEL 5.5 Beta has been released! There should be a fix present in this 
release that addresses your request. Please test and report back results 
here, by March 3rd 2010 (2010-03-03) or sooner.

Upon successful verification of this request, post your results and update 
the Verified field in Bugzilla with the appropriate value.

If you encounter any issues while testing, please describe them and set 
this bug into NEED_INFO. If you encounter new defects or have additional 
patch(es) to request for inclusion, please clone this bug per each request
and escalate through your support representative.
Comment 13 errata-xmlrpc 2010-03-30 03:39:41 EDT
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.


Note You need to log in before you can comment on or make changes to this bug.