Bug 442898

Summary: QDisk freezes cluster when FC is disconnected
Product: [Retired] Red Hat Cluster Suite Reporter: Lon Hohberger <lhh>
Component: cmanAssignee: Lon Hohberger <lhh>
Status: CLOSED ERRATA QA Contact: Cluster QE <mspqa-list>
Severity: high Docs Contact:
Priority: medium    
Version: 4CC: clasohm, cluster-maint, edamato
Target Milestone: ---   
Target Release: ---   
Hardware: i386   
OS: Linux   
Whiteboard:
Fixed In Version: RHBA-2008-0799 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2008-07-25 19:07:10 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Lon Hohberger 2008-04-17 14:21:19 UTC
+++ This bug was initially created as a clone of Bug #442541 +++

Description of problem:
In a 2-nodes cluster using qdisc as a tie-breaker, disconnecting all Fiber
Channel cables from one node (let's call it "node1") has two results:
1) node1's cman gets killed by node2
2) node2 is stuck, and does not take over the service

Version-Release number of selected component (if applicable):
cman-2.0.73-1.el5_1.5-i386

How reproducible:
Always

Steps to Reproduce:
1. configure a simple 2-node cluster with Quorum Disc enabled
2. unplug FC cables connecting to Qdisc from one node (let's say node1)
  
Actual results:
on node1 (from /var/log/messages)
openais[2870]: [CMAN ] cman killed by node 2 because we were killed by cman_tool
or other application
one node2
qdiskd[2973]: <notice> Writing eviction notice for node 2
qdiskd[2973]: <notice> Node 2 evicted
qdiskd[2973]: <crit> Node 2 is undead.
qdiskd[2973]: <alert> Writing eviction notice for node 2
qdiskd[2973]: <crit> Node 2 is undead.
qdiskd[2973]: <alert> Writing eviction notice for node 2
qdiskd[2973]: <crit> Node 2 is undead.
...and here it gets stuck forever

Expected results:
node2 should have fenced node1 and should have brought up the service

Additional info:
reconnecting FC and manually resetting both nodes produces a clean start and a
working cluster. However umplugging the cables again the problem is always
reproduceable

-- Additional comment from lhh on 2008-04-15 10:58 EST --
Created an attachment (id=302467)
Fix.


-- Additional comment from lhh on 2008-04-15 13:51 EST --
Note - Fix is not in 5.2.

-- Additional comment from lhh on 2008-04-17 10:18 EST --
Fix is in RHEL5 branch of git and had already been applied to stable2 and master.




=== Clone for RHEL4 ===
Bug is fixed in RHEL4 (e.g. 4.8) branch.

Comment 2 RHEL Program Management 2008-04-17 14:40:14 UTC
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 4 Lon Hohberger 2008-04-18 15:33:10 UTC
Appears to be fixed in rhel47 branch:

http://sources.redhat.com/git/?p=cluster.git;a=commit;h=5eec9c0832cd1c91d00d2f3e4bd42389a5cbc7bb

Comment 7 errata-xmlrpc 2008-07-25 19:07:10 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2008-0799.html