Bug 472782 - Master in qdisk does not win and both nodes are fenced off in race condition
Master in qdisk does not win and both nodes are fenced off in race condition
Status: CLOSED WONTFIX
Product: Red Hat Cluster Suite
Classification: Red Hat
Component: cman (Show other bugs)
4
All Linux
medium Severity medium
: ---
: ---
Assigned To: Lon Hohberger
Cluster QE
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2008-11-24 11:06 EST by Shane Bradley
Modified: 2016-04-26 09:57 EDT (History)
6 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2010-05-11 13:05:21 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
sosreport for node1 (1.14 MB, application/x-bzip2)
2008-11-24 11:08 EST, Shane Bradley
no flags Details
sosreport for node2 (566.97 KB, application/x-bzip2)
2008-11-24 11:09 EST, Shane Bradley
no flags Details

  None (edit)
Description Shane Bradley 2008-11-24 11:06:38 EST
User-Agent:       Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.4) Gecko/2008111217 Fedora/3.0.4-1.fc9 Firefox/3.0.4

Cluster environment setup up with qdisk heuristic goes to fence race
if the heartbeat link goes down (unplug cable).

There is clearly only 1 master in this configuration. However, the
master does not win and both nodes fence each other off.

--------------------------------------------------------------------------------

Nov 14 11:44:08 pe1950-3 qdiskd[6193]: <info> Assuming master role
Nov 14 12:00:57 pe1950-3 qdiskd[6193]: <notice> Writing eviction notice for node 2

Nov 14 11:44:03 pe1950-4 qdiskd[5857]: <notice> Score sufficient for master operation (1/1; required=1); upgrading
Nov 14 11:44:09 pe1950-4 qdiskd[5857]: <info> Node 1 is the master
Nov 14 12:08:45 pe1950-4 qdiskd[5605]: <info> Quorum Daemon Initializing

----------------------------------------------------------------------------------

Nov 14 11:43:54 pe1950-3 fenced: startup succeeded
Nov 14 12:00:42 pe1950-3 fenced[6203]: pe1950-4-hb not a cluster member after 100 sec post_fail_delay
Nov 14 12:00:42 pe1950-3 fenced[6203]: fencing node "pe1950-4-hb"
Nov 14 12:00:51 pe1950-3 fenced[6203]: fence "pe1950-4-hb" success

Nov 14 11:43:54 pe1950-4 fenced: startup succeeded
Nov 14 12:00:42 pe1950-4 fenced[5867]: pe1950-3-hb not a cluster member after 100 sec post_fail_delay
Nov 14 12:00:42 pe1950-4 fenced[5867]: fencing node "pe1950-3-hb"
Nov 14 12:09:06 pe1950-4 fenced[5615]: pe1950-3-hb not a cluster member after 3 sec post_join_delay
Nov 14 12:09:06 pe1950-4 fenced[5615]: fencing node "pe1950-3-hb"
Nov 14 12:09:17 pe1950-4 fenced[5615]: fence "pe1950-3-hb" success

--------------------------------------------------------------------------------
Nov 14 12:01:25 pe1950-3 qdiskd[6193]: <crit> Node 2 is undead.
Nov 14 12:01:25 pe1950-3 qdiskd[6193]: <alert> Writing eviction notice for node 2
Nov 14 12:01:26 pe1950-3 root: Time Stamp: Fri Nov 14 12:01:25 2008 Node ID: 1 Score: 1/1 (Minimum required = 1) Current state: Master Initializing Set: { } Visible Set: {1 } Master Node ID: 1 Quorate Set: { 1 }
Nov 14 12:01:26 pe1950-3 qdiskd[6193]: <crit> Node 2 is undead.
Nov 14 12:01:26 pe1950-3 qdiskd[6193]: <alert> Writing eviction notice for node 2

Nov 14 12:00:41 pe1950-4 root: Time Stamp: Fri Nov 14 12:00:40 2008 Node ID: 2 Score: 1/1 (Minimum required = 1) Current state: Running Initializing Set: { } Visible Set: { 1 2 } Master Node ID: 1 Quorate Set: { 1 }
Nov 14 12:00:42 pe1950-4 fenced[5867]: pe1950-3-hb not a cluster member after 100 sec post_fail_delay
Nov 14 12:00:42 pe1950-4 fenced[5867]: fencing node "pe1950-3-hb"
Nov 14 12:00:42 pe1950-4 ccsd[5792]: Cluster is not quorate.  Refusing connection.
Nov 14 12:00:42 pe1950-4 ccsd[5792]: Error while processing connect: Connection refused


Reproducible: Always

Steps to Reproduce:
1. Setup cluster with qdisk
2. Shutdown the heartbeat network on both nodes at same time
Actual Results:  
Both nodes try to fence each other off.

Expected Results:  
Both nodes should see that the network is down.
The master in qdisk should fence the other off to prevent race condition.

This issue looks identical to bz for rhel5:
https://bugzilla.redhat.com/show_bug.cgi?id=372901
Comment 1 Shane Bradley 2008-11-24 11:08:29 EST
Created attachment 324493 [details]
sosreport for node1
Comment 2 Shane Bradley 2008-11-24 11:09:05 EST
Created attachment 324494 [details]
sosreport for node2
Comment 4 Lon Hohberger 2009-05-11 15:04:17 EDT
First of all this is a feature request.  While I believe this is a reasonable
course of action, there is no current master-wins behavior in the feature set of qdiskd if no heuristics are present.

The only way to do this cleanly is to interrupt the fencing operation in the
non-master node.

Since CMAN decides on a new membership view prior to fencing operation taking
place, the only method to ensure this works is to notify qdiskd that CMAN has
decided to fence and to have qdiskd do something based on:

- whether or not a master exists
- whether or not the other node exists, and
- if a master exists, which node is master

Some possible solutions as well as a workaround are here:

https://bugzilla.redhat.com/show_bug.cgi?id=372901#c7

Since administrators cannot control which node is the qdiskd master (nor will this be an option), a workaround causing a node to hang will provide predictable behavior in a network partition - moreso than implementation of master-wins.
Comment 7 Lon Hohberger 2009-07-13 09:39:45 EDT
https://bugzilla.redhat.com/show_bug.cgi?id=372901#c9

^^ simple design

Note You need to log in before you can comment on or make changes to this bug.