Description of problem: When setting up a 2-nodes cluster I need a way to avoid fencing-race conditions. Since 5.1 qdisk should work with a "master-wins" logic when no heuristic is configured at all, so that, when the heatbeat channel (in my case a crossover cable) fails, only the slave node gets fenced. Version-Release number of selected component (if applicable): 5.1 - i386 - all errata applied as of 11/9/2007 How reproducible: always Steps to Reproduce: 1. set up a 2nodes-cluster, using the attached cluster.conf as a reference (configuration is very simple: 2 noded, 2 HP ILO fencing devices, 1 quorum disk) 2. don't use the same network for service and heartbeat traffic (simply use a crossover cable on a secondary interface as the heartbeat channel) 3.unplug the crossover cable Actual results: Both nodes try to fence each others. as a result both get rebooted at the same time Expected results: Only one node (the Master) should try to fence the second one Additional info:
Created attachment 252891 [details] Cluster.conf
Pushed for 5.3 http://sources.redhat.com/git/?p=cluster.git;a=commit;h=c1b276c491b0d6e625035b5063532abc3ce23ca4
Comment #3 is wrong; this bug was not fixed; it was pushed to the wrong bugzilla.
Fixing bug state.
Qdiskd normally times out before CMAN, so qdiskd can't change its votes as a function of CMAN transitions, because by then, fencing will have already been started. The simplest thing we can do is make a pseudo fence_qdiskd which hangs forever if we are not the master, and exits successfully if we are. A more complex (to implement) but easier-to-configure solution would be (as Eduardo suggested) to allow fenced and qdiskd to communicate. Unfortunately, there is no API or command for talking with qdiskd, so both of the possible solutions would have to have a qdiskd API designed. A workaround exists (though sub-optimal): Create a fencing agent which does nothing but sleep a few seconds and add it to *one* of the two cluster nodes. This node will always lose in a network partition (say cable pull between two nodes). e.g. #!/bin/sh # /sbin/fence_sleep - sleep for 5 seconds so we lose sleep 5 exit 0 --- <clusternodes> <clusternode name="node1"> <fence method name="1"> <device name="delay"/> <device name="ilo-node1" .../> </fence> </clusternode> <clusternode name="node2"> <fence method name="1"> <device name="ilo-node2" .../> </fence> </clusternode> </clusternodes> <fencedevices> <fencedevice name="delay" agent="/sbin/fence_sleep"/> <fencedevice name="ilo-node1" ... /> <fencedevice name="ilo-node2" ... /> </fencedevices> --- Another workaround is to use heuristics. The only use case where the above workaround is really absolutely required to ensure a win to the fence-race is a network partition in a 2-node cluster using a crossover cable. Because workarounds exist, I am moving this off to 5.5 for now, with a conditional NAK due to the fact that it will require quite a bit of design work.
Why not be able to consider the production network like as a backup intra-cluster one? Or be able to configure more than one intracluster network? Any way to configure an heuristic where we get the node that is master of qdisk at that moment and let it survive over the other one? Any commands for this, apart from parsing log files for lines of kind qdiskd[6238]: <info> Node 2 is the master (btw donna if this is true only with logging enabled or in all cases...) Thanks Gianluca
Ok - Eduardo and I have a design for 2-node cluster master-wins, which is provably correct; it simply involves the slave not advertising to the local instance of CMAN of the qdiskd votes. With correct qdiskd configuration, this nets the following behaviors in a loss of communication: (1) network outage: slave has not been advertising its qdiskd votes to cman, and therefore loses quorum and is fenced by the master (2) master died: slave qdiskd becomes master before CMAN notices the node loss
Created attachment 363004 [details] Pass 1
Any expected release date for these changes or any timing for expected QA tests? Thanks, Gianluca
I used the patch against cman-2.0.115-1.el5_4.3 and it's working fine: in the 2-node cluster with no heuristics defined the master wins. I'm going to install it on production servers too. I'm looking forward to see this in an update for the 5.4.
http://git.fedorahosted.org/git/?p=cluster.git;a=commit;h=2d6c88823e2f2e663d4499769152cc0d21644d34 Reassigning to component owner for build.
Ok - to apply and test on my updated rh el 5.4 test cluster, where do I have to get the patches for qdisk.5, qdisk.h and qdisk.c? From git yesterday or from the patchfile of the end of September? Are they the same or do other changes come in the mean time?
Sadly I had no way to test what happen if you suddenly power-off the master. The expected result is that the remaining node gets the qdisks votes and becomes master without losing the quorum (not even for a short time). Can anyone test this?
Federico, yes -- but: If your cman timings are wrong, and the master failover does not occur prior to the node "noticing" the rebooted master died, you will lose quorum.
Example expected behaviors: Qdiskd master node: [root@molly ~]# cman_tool status Version: 6.2.0 Config Version: 2781 Cluster Name: lolcats Cluster Id: 13719 Cluster Member: Yes Cluster Generation: 1284 Membership state: Cluster-Member Nodes: 2 Expected votes: 3 Quorum device votes: 1 Total votes: 3 Quorum: 2 Active subsystems: 8 Flags: Dirty Ports Bound: 0 Node name: molly Node ID: 1 Multicast addresses: 225.0.0.13 Node addresses: 192.168.122.4 Qdiskd non-master: [root@frederick ~]# cman_tool status Version: 6.2.0 Config Version: 2781 Cluster Name: lolcats Cluster Id: 13719 Cluster Member: Yes Cluster Generation: 1284 Membership state: Cluster-Member Nodes: 2 Expected votes: 3 Total votes: 2 Quorum: 2 Active subsystems: 8 Flags: Dirty Ports Bound: 0 Node name: frederick Node ID: 2 Multicast addresses: 225.0.0.13 Node addresses: 192.168.122.5 Notice the total votes is NOT the same. This is >>CORRECT<< for master-wins. Failure test #1: hard-poweroff of 'molly', qdiskd master: Nov 12 14:00:29 frederick qdiskd[1916]: <info> Assuming master role Nov 12 14:00:30 frederick qdiskd[1916]: <notice> Writing eviction notice for node 1 Nov 12 14:00:30 frederick kernel: dlm: closing connection to node 1 Nov 12 14:00:31 frederick qdiskd[1916]: <notice> Node 1 evicted Nov 12 14:00:35 frederick openais[2522]: [TOTEM] The token was lost in the OPERATIONAL state. ... this means 'frederick' took over qdiskd master role before CMAN noticed the master node was dead. This is what we want.
Failure test #2: Kill network between hosts (make sure fencing device is still accessible): Non-master: Nov 12 15:02:32 molly openais[2664]: [TOTEM] The token was lost in the OPERATIONAL state. Nov 12 15:02:32 molly openais[2664]: [TOTEM] Receive multicast socket recv buffer size (258048 bytes). Nov 12 15:02:32 molly openais[2664]: [TOTEM] Transmit multicast socket send buffer size (258048 bytes). Nov 12 15:02:32 molly openais[2664]: [TOTEM] entering GATHER state from 2. Nov 12 15:02:36 molly openais[2664]: [TOTEM] entering GATHER state from 0. Nov 12 15:03:42 molly syslogd 1.4.1: restart. Qdiskd master: Nov 12 15:02:31 frederick openais[2522]: [TOTEM] The token was lost in the OPERATIONAL state. Nov 12 15:02:31 frederick openais[2522]: [TOTEM] Receive multicast socket recv buffer size (258048 bytes). Nov 12 15:02:31 frederick openais[2522]: [TOTEM] Transmit multicast socket send buffer size (258048 bytes). Nov 12 15:02:31 frederick openais[2522]: [TOTEM] entering GATHER state from 2. Nov 12 15:02:36 frederick openais[2522]: [TOTEM] entering GATHER state from 0. Nov 12 15:02:36 frederick openais[2522]: [TOTEM] Creating commit token because I am the rep. Nov 12 15:02:36 frederick openais[2522]: [TOTEM] Saving state aru 23 high seq received 23 Nov 12 15:02:36 frederick openais[2522]: [TOTEM] Storing new sequence id for ring 51c Nov 12 15:02:36 frederick openais[2522]: [TOTEM] entering COMMIT state. Nov 12 15:02:36 frederick openais[2522]: [TOTEM] entering RECOVERY state. Nov 12 15:02:36 frederick openais[2522]: [TOTEM] position [0] member 192.168.122.5: Nov 12 15:02:36 frederick openais[2522]: [TOTEM] previous ring seq 1304 rep 192.168.122.4 Nov 12 15:02:36 frederick openais[2522]: [TOTEM] aru 23 high delivered 23 received flag 1 Nov 12 15:02:36 frederick openais[2522]: [TOTEM] Did not need to originate any messages in recovery. Nov 12 15:02:36 frederick openais[2522]: [TOTEM] Sending initial ORF token Nov 12 15:02:36 frederick fenced[2549]: molly not a cluster member after 0 sec post_fail_delay Nov 12 15:02:36 frederick kernel: dlm: closing connection to node 1 Nov 12 15:02:36 frederick fenced[2549]: fencing node "molly" Nov 12 15:02:36 frederick openais[2522]: [CLM ] CLM CONFIGURATION CHANGE Nov 12 15:02:36 frederick openais[2522]: [CLM ] New Configuration: Nov 12 15:02:36 frederick openais[2522]: [CLM ] r(0) ip(192.168.122.5) Nov 12 15:02:36 frederick openais[2522]: [CLM ] Members Left: Nov 12 15:02:36 frederick openais[2522]: [CLM ] r(0) ip(192.168.122.4) Nov 12 15:02:36 frederick openais[2522]: [CLM ] Members Joined: Nov 12 15:02:36 frederick openais[2522]: [CLM ] CLM CONFIGURATION CHANGE Nov 12 15:02:36 frederick openais[2522]: [CLM ] New Configuration: Nov 12 15:02:36 frederick openais[2522]: [CLM ] r(0) ip(192.168.122.5) Nov 12 15:02:36 frederick openais[2522]: [CLM ] Members Left: Nov 12 15:02:36 frederick openais[2522]: [CLM ] Members Joined: Nov 12 15:02:36 frederick openais[2522]: [SYNC ] This node is within the primary component and will provide service. Nov 12 15:02:36 frederick openais[2522]: [TOTEM] entering OPERATIONAL state. Nov 12 15:02:36 frederick openais[2522]: [CLM ] got nodejoin message 192.168.122.5 Nov 12 15:02:36 frederick openais[2522]: [CPG ] got joinlist message from node 2 Nov 12 15:02:38 frederick fenced[2549]: fence "molly" success Nov 12 15:02:48 frederick qdiskd[1916]: <notice> Writing eviction notice for node 1 Nov 12 15:02:49 frederick qdiskd[1916]: <notice> Node 1 evicted Note that qdiskd notices the death -after- CMAN in this case, which is expected behavior. since qdiskd was still operating on both nodes, it did not notice the other instance of qdiskd going away until after CMAN had expired the node.
(In reply to comment #19) > Example expected behaviors: > > Qdiskd master node: > Clustat output will show the quorum disk as 'online' only on the qdiskd master node: [root@molly ~]# clustat Cluster Status for lolcats @ Thu Nov 12 15:07:43 2009 Member Status: Quorate Member Name ID Status ------ ---- ---- ------ molly 1 Online, Local frederick 2 Online /dev/hdb1 0 Offline, Quorum Disk [root@frederick ~]# clustat Cluster Status for lolcats @ Thu Nov 12 15:05:12 2009 Member Status: Quorate Member Name ID Status ------ ---- ---- ------ molly 1 Online frederick 2 Online, Local /dev/hdb1 0 Online, Quorum Disk This is expected behavior per design. Only if the master node fails will the quorum disk become 'online' on the other cluster member.
Other notes about master_wins mode: * two node clusters only * configuration of a heuristic will disable master_wins mode
~~ Attention Customers and Partners - RHEL 5.5 Beta is now available on RHN ~~ RHEL 5.5 Beta has been released! There should be a fix present in this release that addresses your request. Please test and report back results here, by March 3rd 2010 (2010-03-03) or sooner. Upon successful verification of this request, post your results and update the Verified field in Bugzilla with the appropriate value. If you encounter any issues while testing, please describe them and set this bug into NEED_INFO. If you encounter new defects or have additional patch(es) to request for inclusion, please clone this bug per each request and escalate through your support representative.
Event posted on 02-17-2010 04:04pm JST by tumeya HP verified the 5.5 beta. This event sent from IssueTracker by tumeya issue 379395
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2010-0266.html
With these rpm versions (latest ones as of 5.5 branch): cman-2.0.115-34.el5_5.1 openais-0.80.6-16.el5_5.2 rgmanager-2.0.52-6.el5 has the situation been reversed again? In fact with a cluster.conf like this: <cluster alias="oradwhstud" config_version="6" name="oradwhstud"> <totem token="162000"/> <cman quorum_dev_poll="80000" expected_votes="3" two_node="0"/> <fence_daemon clean_start="1" post_fail_delay="0" post_join_delay="20"/> .. <quorumd device="/dev/mapper/mpath0" interval="5" label="dwhstudquorum" log_facility="local4" log_level="7" tko="16" votes="1"> </quorumd> ... so without heuristic, I get again Quorum device votes: 1 for both the nodes On node master for quorum: Jul 26 18:35:18 oratest1 openais[7313]: [CLM ] got nodejoin message 192.168.16.22 Jul 26 18:35:18 oratest1 openais[7313]: [CLM ] got nodejoin message 192.168.16.21 Jul 26 18:35:18 oratest1 openais[7313]: [CPG ] got joinlist message from node 1 Jul 26 18:35:48 oratest1 qdiskd[7343]: <debug> Node 2 is UP On the second started one: Jul 26 18:35:44 oratest2 qdiskd[6644]: <debug> Node 1 is UP Jul 26 18:35:49 oratest2 qdiskd[6644]: <info> Node 1 is the master Jul 26 18:35:55 oratest2 openais[6614]: [TOTEM] Retransmit List: 24 Jul 26 18:35:55 oratest2 openais[6614]: [TOTEM] Retransmit List: 27 Jul 26 18:36:00 oratest2 openais[6614]: [TOTEM] Retransmit List: 28 Jul 26 18:36:00 oratest2 openais[6614]: [TOTEM] Retransmit List: 29 Jul 26 18:36:05 oratest2 openais[6614]: [TOTEM] Retransmit List: 2b Jul 26 18:36:05 oratest2 openais[6614]: [TOTEM] Retransmit List: 2d Jul 26 18:36:05 oratest2 openais[6614]: [TOTEM] Retransmit List: 30 Jul 26 18:36:05 oratest2 openais[6614]: [TOTEM] Retransmit List: 32 Jul 26 18:36:40 oratest2 qdiskd[6644]: <info> Initial score 1/1 Jul 26 18:36:40 oratest2 qdiskd[6644]: <info> Initialization complete Jul 26 18:36:40 oratest2 openais[6614]: [CMAN ] quorum device registered Jul 26 18:36:40 oratest2 qdiskd[6644]: <notice> Score sufficient for master operation (1/1; required=1); upgrading If I interrupt intra-cluster network I get again mutual fencing... Thanks, Gianluca
Gainluca, you forgot to set master_wins="1" in <quorumd> tag.
OK, I missed that particular, as it was inside the man page attachment of comment#10 ... Now I see it is not set by default... Thanks