Bug 442541

Summary:

QDisk freezes cluster when FC is disconnected

Product:

Red Hat Enterprise Linux 5

Reporter:

Mattia Gandolfi <mgandolf>

Component:

cman

Assignee:

Lon Hohberger <lhh>

Status:

CLOSED ERRATA

QA Contact:

GFS Bugs <gfs-bugs>

Severity:

high

Docs Contact:

Priority:

medium

Version:

5.1

CC:

clasohm, cluster-maint, edamato, tao, tmarshal

Target Milestone:

Target Release:

---

Hardware:

i386

OS:

Linux

Whiteboard:

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2009-01-20 21:51:34 UTC

Type:

---

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
First node's sosreport	none
Second node's sosreport	none
Fix.	none

Description Mattia Gandolfi 2008-04-15 14:05:34 UTC

Description of problem:
In a 2-nodes cluster using qdisc as a tie-breaker, disconnecting all Fiber
Channel cables from one node (let's call it "node1") has two results:
1) node1's cman gets killed by node2
2) node2 is stuck, and does not take over the service

Version-Release number of selected component (if applicable):
cman-2.0.73-1.el5_1.5-i386

How reproducible:
Always

Steps to Reproduce:
1. configure a simple 2-node cluster with Quorum Disc enabled
2. unplug FC cables connecting to Qdisc from one node (let's say node1)
  
Actual results:
on node1 (from /var/log/messages)
openais[2870]: [CMAN ] cman killed by node 2 because we were killed by cman_tool
or other application
one node2
qdiskd[2973]: <notice> Writing eviction notice for node 2
qdiskd[2973]: <notice> Node 2 evicted
qdiskd[2973]: <crit> Node 2 is undead.
qdiskd[2973]: <alert> Writing eviction notice for node 2
qdiskd[2973]: <crit> Node 2 is undead.
qdiskd[2973]: <alert> Writing eviction notice for node 2
qdiskd[2973]: <crit> Node 2 is undead.
...and here it gets stuck forever

Expected results:
node2 should have fenced node1 and should have brought up the service

Additional info:
reconnecting FC and manually resetting both nodes produces a clean start and a
working cluster. However umplugging the cables again the problem is always
reproduceable

Comment 1 Mattia Gandolfi 2008-04-15 14:05:34 UTC

Created attachment 302455 [details]
Cluster.conf

Comment 4 Lon Hohberger 2008-04-15 14:58:45 UTC

Created attachment 302467 [details]
Fix.

Comment 5 Lon Hohberger 2008-04-15 17:51:31 UTC

Note - Fix is not in 5.2.

Comment 6 Lon Hohberger 2008-04-17 14:18:30 UTC

Fix is in RHEL5 branch of git and had already been applied to stable2 and master.

Comment 10 errata-xmlrpc 2009-01-20 21:51:34 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2009-0189.html