Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 220211

Summary:

multiple qdisk master after network outage

Product:

[Retired] Red Hat Cluster Suite

Reporter:

Frederik Ferner <frederik.ferner>

Component:

cman

Assignee:

Lon Hohberger <lhh>

Status:

CLOSED ERRATA

QA Contact:

Cluster QE <mspqa-list>

Severity:

medium

Docs Contact:

Priority:

medium

Version:

CC:

cfeist, cluster-maint, jos, tao

Target Milestone:

---

Target Release:

---

Hardware:

All

OS:

Linux

Whiteboard:

Fixed In Version:

RHBA-2007-0134

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2007-05-10 21:04:50 UTC

Type:

---

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
Email describing the problem in more detail	none
Fix	none

Description Frederik Ferner 2006-12-19 18:28:44 UTC

Description of problem:

On three of my clusters qdiskd reported more than one master after a network
issue affecting only the connection to the outside of the cluster. After the
network came back I found the following in the logfiles of both nodes:

<snip>
Dec 15 10:31:45 duoserv2 qdiskd[31393]: <notice> Score sufficient for master
operation (6/3; max=6); upgrading
Dec 15 10:31:46 duoserv2 qdiskd[31393]: <info> Assuming master role
Dec 15 10:31:47 duoserv2 kernel: CMAN: quorum regained, resuming activity
Dec 15 10:31:47 duoserv2 clurgmgrd[7950]: <notice> Quorum Achieved
Dec 15 10:31:47 duoserv2 clurgmgrd[7950]: <info> Magma Event: Membership Change
Dec 15 10:31:47 duoserv2 clurgmgrd[7950]: <info> State change: Local UP
Dec 15 10:31:47 duoserv2 clurgmgrd[7950]: <info> State change: duoserv1 UP
Dec 15 10:31:47 duoserv2 clurgmgrd[7950]: <info> Loading Service Data
Dec 15 10:31:47 duoserv2 ccsd[5595]: Cluster is quorate.  Allowing connections.
Dec 15 10:31:50 duoserv2 clurgmgrd: [7950]: <info> /dev/mapper/logs1-logs1 is
not mounted
Dec 15 10:31:51 duoserv2 qdiskd[31393]: <crit> Critical Error: More than one
master found!
Dec 15 10:31:51 duoserv2 qdiskd[31393]: <crit> A master exists, but it's not me?!
Dec 15 10:31:52 duoserv2 qdiskd[31393]: <info> Node 1 is the master
...

At the same time on the second node:
Dec 15 10:31:45 duoserv1 qdiskd[316]: <notice> Score sufficient for master
operation (5/3; max=6); upgrading
Dec 15 10:31:46 duoserv1 qdiskd[316]: <info> Assuming master role
Dec 15 10:31:47 duoserv1 kernel: CMAN: quorum regained, resuming activity
Dec 15 10:31:47 duoserv1 ccsd[5624]: Cluster is quorate.  Allowing connections.
Dec 15 10:31:47 duoserv1 clurgmgrd[3631]: <notice> Quorum Achieved
Dec 15 10:31:51 duoserv1 qdiskd[316]: <crit> Critical Error: More than one
master found!
Dec 15 10:31:52 duoserv1 qdiskd[316]: <info> Node 2 is the master
Dec 15 10:31:52 duoserv1 qdiskd[316]: <crit> Critical Error: More than one
master found!
<snip>

After a restart of qdiskd everything is working fine again.

Version-Release number of selected component (if applicable):
I have the following packages installed on both nodes
ccs-1.0.7-0
rgmanager-1.9.54-1
lvm2-cluster-2.02.01-1.2.RHEL4
cman-1.0.11-0
cman-kernel-smp-2.6.9-43.8.5
fence-1.32.25-1
cman-kernel-smp-2.6.9-45.8

How reproducible:
I have not yet tried to reproduce it but it happened on three clusters at the
same time.

Additional info:
see attached Email to  linux-cluster

If I can provide any other information please ask.

Comment 1 Frederik Ferner 2006-12-19 18:28:44 UTC

Created attachment 144025 [details]
Email describing the problem in more detail

Comment 2 Lon Hohberger 2007-01-04 22:49:41 UTC

Hmm, how close in time are the two systems; are they using NTP ?

Comment 3 Frederik Ferner 2007-01-05 07:58:13 UTC

Yes, they are using NTP.

Comment 4 Lon Hohberger 2007-01-08 14:41:50 UTC

It looks like qdiskd doesn't wait long enough for votes prior to declaring
itself a master; it's supposed to wait at least one full cycle after thinking
it's online prior to even thinking about trying to make a bid for master status.

Comment 5 Lon Hohberger 2007-01-22 22:52:51 UTC

Fixes in CVS; RHEL4 / RHEL5 / head branches.

Comment 6 Lon Hohberger 2007-01-23 17:35:45 UTC

Let me know if you'd like test packages.

Comment 8 Frederik Ferner 2007-01-24 08:23:58 UTC

Yes, test packages would be great.

Comment 9 Lon Hohberger 2007-01-24 17:45:20 UTC

http://people.redhat.com/lhh/packages.html

You'll need the cman-kernel update (built against 42.0.3 kernel I think).

Comment 10 Lon Hohberger 2007-03-20 14:11:40 UTC

NOT fixed.  I hit it this morning.

Comment 11 Lon Hohberger 2007-03-20 15:04:10 UTC

Part 1:

Ok, here's the scoop - with the current RHEL4 CVS bits, the master_wait time
(time to wait for NACKs / ACKs) was less than the amount of time used to declare
a node up.

This means that multiple nodes could make bids for master - and because they
would ignore each other's disk-writes until the node was declared up, multiple
nodes could declare themselves master.  This is obviously wrong.

By changing the formula for calculating the tko_up and master_wait variables to
tko/3 and tko/2 respectively (they were the opposite), we eliminate the problem.
 Furthermore, we must restrict the master_wait to be a value greater than the
tko_up value.

Part 2:

We require a resolution algorithm in the event that this happens beyond our
control - so that we have predictable behavior in multi-master situations.  The
idea here is that we need all masters to rescind master status gracefully -
forcing a re-election.  This isn't hard, and needs to be done.

Comment 12 Lon Hohberger 2007-03-20 19:45:47 UTC

Created attachment 150525 [details]
Fix

Comment 13 Chris Feist 2007-03-20 20:55:58 UTC

Adding rhel-4.5 tag so we can build.

Comment 17 Red Hat Bugzilla 2007-05-10 21:04:50 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2007-0134.html