Bug 442898

Summary:	QDisk freezes cluster when FC is disconnected
Product:	[Retired] Red Hat Cluster Suite	Reporter:	Lon Hohberger <lhh>
Component:	cman	Assignee:	Lon Hohberger <lhh>
Status:	CLOSED ERRATA	QA Contact:	Cluster QE <mspqa-list>
Severity:	high	Docs Contact:
Priority:	medium
Version:	4	CC:	clasohm, cluster-maint, edamato
Target Milestone:	---
Target Release:	---
Hardware:	i386
OS:	Linux
Whiteboard:
Fixed In Version:	RHBA-2008-0799	Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2008-07-25 19:07:10 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Lon Hohberger 2008-04-17 14:21:19 UTC

+++ This bug was initially created as a clone of Bug #442541 +++

Description of problem:
In a 2-nodes cluster using qdisc as a tie-breaker, disconnecting all Fiber
Channel cables from one node (let's call it "node1") has two results:
1) node1's cman gets killed by node2
2) node2 is stuck, and does not take over the service

Version-Release number of selected component (if applicable):
cman-2.0.73-1.el5_1.5-i386

How reproducible:
Always

Steps to Reproduce:
1. configure a simple 2-node cluster with Quorum Disc enabled
2. unplug FC cables connecting to Qdisc from one node (let's say node1)
  
Actual results:
on node1 (from /var/log/messages)
openais[2870]: [CMAN ] cman killed by node 2 because we were killed by cman_tool
or other application
one node2
qdiskd[2973]: <notice> Writing eviction notice for node 2
qdiskd[2973]: <notice> Node 2 evicted
qdiskd[2973]: <crit> Node 2 is undead.
qdiskd[2973]: <alert> Writing eviction notice for node 2
qdiskd[2973]: <crit> Node 2 is undead.
qdiskd[2973]: <alert> Writing eviction notice for node 2
qdiskd[2973]: <crit> Node 2 is undead.
...and here it gets stuck forever

Expected results:
node2 should have fenced node1 and should have brought up the service

Additional info:
reconnecting FC and manually resetting both nodes produces a clean start and a
working cluster. However umplugging the cables again the problem is always
reproduceable

-- Additional comment from lhh on 2008-04-15 10:58 EST --
Created an attachment (id=302467)
Fix.


-- Additional comment from lhh on 2008-04-15 13:51 EST --
Note - Fix is not in 5.2.

-- Additional comment from lhh on 2008-04-17 10:18 EST --
Fix is in RHEL5 branch of git and had already been applied to stable2 and master.




=== Clone for RHEL4 ===
Bug is fixed in RHEL4 (e.g. 4.8) branch.

Comment 2 RHEL Program Management 2008-04-17 14:40:14 UTC

This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 4 Lon Hohberger 2008-04-18 15:33:10 UTC

Appears to be fixed in rhel47 branch:

http://sources.redhat.com/git/?p=cluster.git;a=commit;h=5eec9c0832cd1c91d00d2f3e4bd42389a5cbc7bb

Comment 7 errata-xmlrpc 2008-07-25 19:07:10 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2008-0799.html