User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.4) Gecko/2008111217 Fedora/3.0.4-1.fc9 Firefox/3.0.4 Cluster environment setup up with qdisk heuristic goes to fence race if the heartbeat link goes down (unplug cable). There is clearly only 1 master in this configuration. However, the master does not win and both nodes fence each other off. -------------------------------------------------------------------------------- Nov 14 11:44:08 pe1950-3 qdiskd[6193]: <info> Assuming master role Nov 14 12:00:57 pe1950-3 qdiskd[6193]: <notice> Writing eviction notice for node 2 Nov 14 11:44:03 pe1950-4 qdiskd[5857]: <notice> Score sufficient for master operation (1/1; required=1); upgrading Nov 14 11:44:09 pe1950-4 qdiskd[5857]: <info> Node 1 is the master Nov 14 12:08:45 pe1950-4 qdiskd[5605]: <info> Quorum Daemon Initializing ---------------------------------------------------------------------------------- Nov 14 11:43:54 pe1950-3 fenced: startup succeeded Nov 14 12:00:42 pe1950-3 fenced[6203]: pe1950-4-hb not a cluster member after 100 sec post_fail_delay Nov 14 12:00:42 pe1950-3 fenced[6203]: fencing node "pe1950-4-hb" Nov 14 12:00:51 pe1950-3 fenced[6203]: fence "pe1950-4-hb" success Nov 14 11:43:54 pe1950-4 fenced: startup succeeded Nov 14 12:00:42 pe1950-4 fenced[5867]: pe1950-3-hb not a cluster member after 100 sec post_fail_delay Nov 14 12:00:42 pe1950-4 fenced[5867]: fencing node "pe1950-3-hb" Nov 14 12:09:06 pe1950-4 fenced[5615]: pe1950-3-hb not a cluster member after 3 sec post_join_delay Nov 14 12:09:06 pe1950-4 fenced[5615]: fencing node "pe1950-3-hb" Nov 14 12:09:17 pe1950-4 fenced[5615]: fence "pe1950-3-hb" success -------------------------------------------------------------------------------- Nov 14 12:01:25 pe1950-3 qdiskd[6193]: <crit> Node 2 is undead. Nov 14 12:01:25 pe1950-3 qdiskd[6193]: <alert> Writing eviction notice for node 2 Nov 14 12:01:26 pe1950-3 root: Time Stamp: Fri Nov 14 12:01:25 2008 Node ID: 1 Score: 1/1 (Minimum required = 1) Current state: Master Initializing Set: { } Visible Set: {1 } Master Node ID: 1 Quorate Set: { 1 } Nov 14 12:01:26 pe1950-3 qdiskd[6193]: <crit> Node 2 is undead. Nov 14 12:01:26 pe1950-3 qdiskd[6193]: <alert> Writing eviction notice for node 2 Nov 14 12:00:41 pe1950-4 root: Time Stamp: Fri Nov 14 12:00:40 2008 Node ID: 2 Score: 1/1 (Minimum required = 1) Current state: Running Initializing Set: { } Visible Set: { 1 2 } Master Node ID: 1 Quorate Set: { 1 } Nov 14 12:00:42 pe1950-4 fenced[5867]: pe1950-3-hb not a cluster member after 100 sec post_fail_delay Nov 14 12:00:42 pe1950-4 fenced[5867]: fencing node "pe1950-3-hb" Nov 14 12:00:42 pe1950-4 ccsd[5792]: Cluster is not quorate. Refusing connection. Nov 14 12:00:42 pe1950-4 ccsd[5792]: Error while processing connect: Connection refused Reproducible: Always Steps to Reproduce: 1. Setup cluster with qdisk 2. Shutdown the heartbeat network on both nodes at same time Actual Results: Both nodes try to fence each other off. Expected Results: Both nodes should see that the network is down. The master in qdisk should fence the other off to prevent race condition. This issue looks identical to bz for rhel5: https://bugzilla.redhat.com/show_bug.cgi?id=372901
Created attachment 324493 [details] sosreport for node1
Created attachment 324494 [details] sosreport for node2
First of all this is a feature request. While I believe this is a reasonable course of action, there is no current master-wins behavior in the feature set of qdiskd if no heuristics are present. The only way to do this cleanly is to interrupt the fencing operation in the non-master node. Since CMAN decides on a new membership view prior to fencing operation taking place, the only method to ensure this works is to notify qdiskd that CMAN has decided to fence and to have qdiskd do something based on: - whether or not a master exists - whether or not the other node exists, and - if a master exists, which node is master Some possible solutions as well as a workaround are here: https://bugzilla.redhat.com/show_bug.cgi?id=372901#c7 Since administrators cannot control which node is the qdiskd master (nor will this be an option), a workaround causing a node to hang will provide predictable behavior in a network partition - moreso than implementation of master-wins.
https://bugzilla.redhat.com/show_bug.cgi?id=372901#c9 ^^ simple design