Red Hat Bugzilla – Bug 220211
multiple qdisk master after network outage
Last modified: 2009-04-16 16:21:38 EDT
Description of problem:
On three of my clusters qdiskd reported more than one master after a network
issue affecting only the connection to the outside of the cluster. After the
network came back I found the following in the logfiles of both nodes:
Dec 15 10:31:45 duoserv2 qdiskd: <notice> Score sufficient for master
operation (6/3; max=6); upgrading
Dec 15 10:31:46 duoserv2 qdiskd: <info> Assuming master role
Dec 15 10:31:47 duoserv2 kernel: CMAN: quorum regained, resuming activity
Dec 15 10:31:47 duoserv2 clurgmgrd: <notice> Quorum Achieved
Dec 15 10:31:47 duoserv2 clurgmgrd: <info> Magma Event: Membership Change
Dec 15 10:31:47 duoserv2 clurgmgrd: <info> State change: Local UP
Dec 15 10:31:47 duoserv2 clurgmgrd: <info> State change: duoserv1 UP
Dec 15 10:31:47 duoserv2 clurgmgrd: <info> Loading Service Data
Dec 15 10:31:47 duoserv2 ccsd: Cluster is quorate. Allowing connections.
Dec 15 10:31:50 duoserv2 clurgmgrd: : <info> /dev/mapper/logs1-logs1 is
Dec 15 10:31:51 duoserv2 qdiskd: <crit> Critical Error: More than one
Dec 15 10:31:51 duoserv2 qdiskd: <crit> A master exists, but it's not me?!
Dec 15 10:31:52 duoserv2 qdiskd: <info> Node 1 is the master
At the same time on the second node:
Dec 15 10:31:45 duoserv1 qdiskd: <notice> Score sufficient for master
operation (5/3; max=6); upgrading
Dec 15 10:31:46 duoserv1 qdiskd: <info> Assuming master role
Dec 15 10:31:47 duoserv1 kernel: CMAN: quorum regained, resuming activity
Dec 15 10:31:47 duoserv1 ccsd: Cluster is quorate. Allowing connections.
Dec 15 10:31:47 duoserv1 clurgmgrd: <notice> Quorum Achieved
Dec 15 10:31:51 duoserv1 qdiskd: <crit> Critical Error: More than one
Dec 15 10:31:52 duoserv1 qdiskd: <info> Node 2 is the master
Dec 15 10:31:52 duoserv1 qdiskd: <crit> Critical Error: More than one
After a restart of qdiskd everything is working fine again.
Version-Release number of selected component (if applicable):
I have the following packages installed on both nodes
I have not yet tried to reproduce it but it happened on three clusters at the
see attached Email to firstname.lastname@example.org
If I can provide any other information please ask.
Created attachment 144025 [details]
Email describing the problem in more detail
Hmm, how close in time are the two systems; are they using NTP ?
Yes, they are using NTP.
It looks like qdiskd doesn't wait long enough for votes prior to declaring
itself a master; it's supposed to wait at least one full cycle after thinking
it's online prior to even thinking about trying to make a bid for master status.
Fixes in CVS; RHEL4 / RHEL5 / head branches.
Let me know if you'd like test packages.
Yes, test packages would be great.
You'll need the cman-kernel update (built against 42.0.3 kernel I think).
NOT fixed. I hit it this morning.
Ok, here's the scoop - with the current RHEL4 CVS bits, the master_wait time
(time to wait for NACKs / ACKs) was less than the amount of time used to declare
a node up.
This means that multiple nodes could make bids for master - and because they
would ignore each other's disk-writes until the node was declared up, multiple
nodes could declare themselves master. This is obviously wrong.
By changing the formula for calculating the tko_up and master_wait variables to
tko/3 and tko/2 respectively (they were the opposite), we eliminate the problem.
Furthermore, we must restrict the master_wait to be a value greater than the
We require a resolution algorithm in the event that this happens beyond our
control - so that we have predictable behavior in multi-master situations. The
idea here is that we need all masters to rescind master status gracefully -
forcing a re-election. This isn't hard, and needs to be done.
Created attachment 150525 [details]
Adding rhel-4.5 tag so we can build.
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.