Bug 220211 - multiple qdisk master after network outage
multiple qdisk master after network outage
Status: CLOSED ERRATA
Product: Red Hat Cluster Suite
Classification: Red Hat
Component: cman (Show other bugs)
4
All Linux
medium Severity medium
: ---
: ---
Assigned To: Lon Hohberger
Cluster QE
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2006-12-19 13:28 EST by Frederik Ferner
Modified: 2009-04-16 16:21 EDT (History)
4 users (show)

See Also:
Fixed In Version: RHBA-2007-0134
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2007-05-10 17:04:50 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
Email describing the problem in more detail (4.67 KB, text/plain)
2006-12-19 13:28 EST, Frederik Ferner
no flags Details
Fix (4.64 KB, patch)
2007-03-20 15:45 EDT, Lon Hohberger
no flags Details | Diff

  None (edit)
Description Frederik Ferner 2006-12-19 13:28:44 EST
Description of problem:

On three of my clusters qdiskd reported more than one master after a network
issue affecting only the connection to the outside of the cluster. After the
network came back I found the following in the logfiles of both nodes:

<snip>
Dec 15 10:31:45 duoserv2 qdiskd[31393]: <notice> Score sufficient for master
operation (6/3; max=6); upgrading
Dec 15 10:31:46 duoserv2 qdiskd[31393]: <info> Assuming master role
Dec 15 10:31:47 duoserv2 kernel: CMAN: quorum regained, resuming activity
Dec 15 10:31:47 duoserv2 clurgmgrd[7950]: <notice> Quorum Achieved
Dec 15 10:31:47 duoserv2 clurgmgrd[7950]: <info> Magma Event: Membership Change
Dec 15 10:31:47 duoserv2 clurgmgrd[7950]: <info> State change: Local UP
Dec 15 10:31:47 duoserv2 clurgmgrd[7950]: <info> State change: duoserv1 UP
Dec 15 10:31:47 duoserv2 clurgmgrd[7950]: <info> Loading Service Data
Dec 15 10:31:47 duoserv2 ccsd[5595]: Cluster is quorate.  Allowing connections.
Dec 15 10:31:50 duoserv2 clurgmgrd: [7950]: <info> /dev/mapper/logs1-logs1 is
not mounted
Dec 15 10:31:51 duoserv2 qdiskd[31393]: <crit> Critical Error: More than one
master found!
Dec 15 10:31:51 duoserv2 qdiskd[31393]: <crit> A master exists, but it's not me?!
Dec 15 10:31:52 duoserv2 qdiskd[31393]: <info> Node 1 is the master
...

At the same time on the second node:
Dec 15 10:31:45 duoserv1 qdiskd[316]: <notice> Score sufficient for master
operation (5/3; max=6); upgrading
Dec 15 10:31:46 duoserv1 qdiskd[316]: <info> Assuming master role
Dec 15 10:31:47 duoserv1 kernel: CMAN: quorum regained, resuming activity
Dec 15 10:31:47 duoserv1 ccsd[5624]: Cluster is quorate.  Allowing connections.
Dec 15 10:31:47 duoserv1 clurgmgrd[3631]: <notice> Quorum Achieved
Dec 15 10:31:51 duoserv1 qdiskd[316]: <crit> Critical Error: More than one
master found!
Dec 15 10:31:52 duoserv1 qdiskd[316]: <info> Node 2 is the master
Dec 15 10:31:52 duoserv1 qdiskd[316]: <crit> Critical Error: More than one
master found!
<snip>

After a restart of qdiskd everything is working fine again.

Version-Release number of selected component (if applicable):
I have the following packages installed on both nodes
ccs-1.0.7-0
rgmanager-1.9.54-1
lvm2-cluster-2.02.01-1.2.RHEL4
cman-1.0.11-0
cman-kernel-smp-2.6.9-43.8.5
fence-1.32.25-1
cman-kernel-smp-2.6.9-45.8

How reproducible:
I have not yet tried to reproduce it but it happened on three clusters at the
same time.

Additional info:
see attached Email to  linux-cluster@redhat.com

If I can provide any other information please ask.
Comment 1 Frederik Ferner 2006-12-19 13:28:44 EST
Created attachment 144025 [details]
Email describing the problem in more detail
Comment 2 Lon Hohberger 2007-01-04 17:49:41 EST
Hmm, how close in time are the two systems; are they using NTP ?

Comment 3 Frederik Ferner 2007-01-05 02:58:13 EST
Yes, they are using NTP.
Comment 4 Lon Hohberger 2007-01-08 09:41:50 EST
It looks like qdiskd doesn't wait long enough for votes prior to declaring
itself a master; it's supposed to wait at least one full cycle after thinking
it's online prior to even thinking about trying to make a bid for master status.
Comment 5 Lon Hohberger 2007-01-22 17:52:51 EST
Fixes in CVS; RHEL4 / RHEL5 / head branches.
Comment 6 Lon Hohberger 2007-01-23 12:35:45 EST
Let me know if you'd like test packages.
Comment 8 Frederik Ferner 2007-01-24 03:23:58 EST
Yes, test packages would be great.
Comment 9 Lon Hohberger 2007-01-24 12:45:20 EST
http://people.redhat.com/lhh/packages.html

You'll need the cman-kernel update (built against 42.0.3 kernel I think).
Comment 10 Lon Hohberger 2007-03-20 10:11:40 EDT
NOT fixed.  I hit it this morning.
Comment 11 Lon Hohberger 2007-03-20 11:04:10 EDT
Part 1:

Ok, here's the scoop - with the current RHEL4 CVS bits, the master_wait time
(time to wait for NACKs / ACKs) was less than the amount of time used to declare
a node up.

This means that multiple nodes could make bids for master - and because they
would ignore each other's disk-writes until the node was declared up, multiple
nodes could declare themselves master.  This is obviously wrong.

By changing the formula for calculating the tko_up and master_wait variables to
tko/3 and tko/2 respectively (they were the opposite), we eliminate the problem.
 Furthermore, we must restrict the master_wait to be a value greater than the
tko_up value.

Part 2:

We require a resolution algorithm in the event that this happens beyond our
control - so that we have predictable behavior in multi-master situations.  The
idea here is that we need all masters to rescind master status gracefully -
forcing a re-election.  This isn't hard, and needs to be done.
Comment 12 Lon Hohberger 2007-03-20 15:45:47 EDT
Created attachment 150525 [details]
Fix
Comment 13 Chris Feist 2007-03-20 16:55:58 EDT
Adding rhel-4.5 tag so we can build.
Comment 17 Red Hat Bugzilla 2007-05-10 17:04:50 EDT
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2007-0134.html

Note You need to log in before you can comment on or make changes to this bug.