Bug 220211 - multiple qdisk master after network outage
Summary: multiple qdisk master after network outage
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Cluster Suite
Classification: Retired
Component: cman   
(Show other bugs)
Version: 4
Hardware: All
OS: Linux
medium
medium
Target Milestone: ---
Assignee: Lon Hohberger
QA Contact: Cluster QE
URL:
Whiteboard:
Keywords:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2006-12-19 18:28 UTC by Frederik Ferner
Modified: 2009-04-16 20:21 UTC (History)
4 users (show)

Fixed In Version: RHBA-2007-0134
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2007-05-10 21:04:50 UTC
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
Email describing the problem in more detail (4.67 KB, text/plain)
2006-12-19 18:28 UTC, Frederik Ferner
no flags Details
Fix (4.64 KB, patch)
2007-03-20 19:45 UTC, Lon Hohberger
no flags Details | Diff


External Trackers
Tracker ID Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2007:0134 normal SHIPPED_LIVE cman bug fix update 2007-05-10 21:04:00 UTC

Description Frederik Ferner 2006-12-19 18:28:44 UTC
Description of problem:

On three of my clusters qdiskd reported more than one master after a network
issue affecting only the connection to the outside of the cluster. After the
network came back I found the following in the logfiles of both nodes:

<snip>
Dec 15 10:31:45 duoserv2 qdiskd[31393]: <notice> Score sufficient for master
operation (6/3; max=6); upgrading
Dec 15 10:31:46 duoserv2 qdiskd[31393]: <info> Assuming master role
Dec 15 10:31:47 duoserv2 kernel: CMAN: quorum regained, resuming activity
Dec 15 10:31:47 duoserv2 clurgmgrd[7950]: <notice> Quorum Achieved
Dec 15 10:31:47 duoserv2 clurgmgrd[7950]: <info> Magma Event: Membership Change
Dec 15 10:31:47 duoserv2 clurgmgrd[7950]: <info> State change: Local UP
Dec 15 10:31:47 duoserv2 clurgmgrd[7950]: <info> State change: duoserv1 UP
Dec 15 10:31:47 duoserv2 clurgmgrd[7950]: <info> Loading Service Data
Dec 15 10:31:47 duoserv2 ccsd[5595]: Cluster is quorate.  Allowing connections.
Dec 15 10:31:50 duoserv2 clurgmgrd: [7950]: <info> /dev/mapper/logs1-logs1 is
not mounted
Dec 15 10:31:51 duoserv2 qdiskd[31393]: <crit> Critical Error: More than one
master found!
Dec 15 10:31:51 duoserv2 qdiskd[31393]: <crit> A master exists, but it's not me?!
Dec 15 10:31:52 duoserv2 qdiskd[31393]: <info> Node 1 is the master
...

At the same time on the second node:
Dec 15 10:31:45 duoserv1 qdiskd[316]: <notice> Score sufficient for master
operation (5/3; max=6); upgrading
Dec 15 10:31:46 duoserv1 qdiskd[316]: <info> Assuming master role
Dec 15 10:31:47 duoserv1 kernel: CMAN: quorum regained, resuming activity
Dec 15 10:31:47 duoserv1 ccsd[5624]: Cluster is quorate.  Allowing connections.
Dec 15 10:31:47 duoserv1 clurgmgrd[3631]: <notice> Quorum Achieved
Dec 15 10:31:51 duoserv1 qdiskd[316]: <crit> Critical Error: More than one
master found!
Dec 15 10:31:52 duoserv1 qdiskd[316]: <info> Node 2 is the master
Dec 15 10:31:52 duoserv1 qdiskd[316]: <crit> Critical Error: More than one
master found!
<snip>

After a restart of qdiskd everything is working fine again.

Version-Release number of selected component (if applicable):
I have the following packages installed on both nodes
ccs-1.0.7-0
rgmanager-1.9.54-1
lvm2-cluster-2.02.01-1.2.RHEL4
cman-1.0.11-0
cman-kernel-smp-2.6.9-43.8.5
fence-1.32.25-1
cman-kernel-smp-2.6.9-45.8

How reproducible:
I have not yet tried to reproduce it but it happened on three clusters at the
same time.

Additional info:
see attached Email to  linux-cluster@redhat.com

If I can provide any other information please ask.

Comment 1 Frederik Ferner 2006-12-19 18:28:44 UTC
Created attachment 144025 [details]
Email describing the problem in more detail

Comment 2 Lon Hohberger 2007-01-04 22:49:41 UTC
Hmm, how close in time are the two systems; are they using NTP ?



Comment 3 Frederik Ferner 2007-01-05 07:58:13 UTC
Yes, they are using NTP.

Comment 4 Lon Hohberger 2007-01-08 14:41:50 UTC
It looks like qdiskd doesn't wait long enough for votes prior to declaring
itself a master; it's supposed to wait at least one full cycle after thinking
it's online prior to even thinking about trying to make a bid for master status.

Comment 5 Lon Hohberger 2007-01-22 22:52:51 UTC
Fixes in CVS; RHEL4 / RHEL5 / head branches.

Comment 6 Lon Hohberger 2007-01-23 17:35:45 UTC
Let me know if you'd like test packages.

Comment 8 Frederik Ferner 2007-01-24 08:23:58 UTC
Yes, test packages would be great.

Comment 9 Lon Hohberger 2007-01-24 17:45:20 UTC
http://people.redhat.com/lhh/packages.html

You'll need the cman-kernel update (built against 42.0.3 kernel I think).

Comment 10 Lon Hohberger 2007-03-20 14:11:40 UTC
NOT fixed.  I hit it this morning.

Comment 11 Lon Hohberger 2007-03-20 15:04:10 UTC
Part 1:

Ok, here's the scoop - with the current RHEL4 CVS bits, the master_wait time
(time to wait for NACKs / ACKs) was less than the amount of time used to declare
a node up.

This means that multiple nodes could make bids for master - and because they
would ignore each other's disk-writes until the node was declared up, multiple
nodes could declare themselves master.  This is obviously wrong.

By changing the formula for calculating the tko_up and master_wait variables to
tko/3 and tko/2 respectively (they were the opposite), we eliminate the problem.
 Furthermore, we must restrict the master_wait to be a value greater than the
tko_up value.

Part 2:

We require a resolution algorithm in the event that this happens beyond our
control - so that we have predictable behavior in multi-master situations.  The
idea here is that we need all masters to rescind master status gracefully -
forcing a re-election.  This isn't hard, and needs to be done.

Comment 12 Lon Hohberger 2007-03-20 19:45:47 UTC
Created attachment 150525 [details]
Fix

Comment 13 Chris Feist 2007-03-20 20:55:58 UTC
Adding rhel-4.5 tag so we can build.

Comment 17 Red Hat Bugzilla 2007-05-10 21:04:50 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2007-0134.html



Note You need to log in before you can comment on or make changes to this bug.