372901 – Qdisk without heuristic in a 2-nodes cluster: "master-wins" mode not working properly

Bug 372901 - Qdisk without heuristic in a 2-nodes cluster: "master-wins" mode not working properly

Summary: Qdisk without heuristic in a 2-nodes cluster: "master-wins" mode not working ...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 5
Classification:	Red Hat
Component:	cman
Sub Component:
Version:	5.4
Hardware:	i386
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	---
Assignee:	Christine Caulfield
QA Contact:	Cluster QE
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2007-11-09 15:31 UTC by Mattia Gandolfi
Modified:	2018-10-27 11:39 UTC (History)
CC List:	10 users (show)
Fixed In Version:	cman-2.0.115-15.el5.src.rpm
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2010-03-30 08:40:23 UTC
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
Cluster.conf (1.75 KB, text/plain) 2007-11-09 15:31 UTC, Mattia Gandolfi	no flags	Details
Pass 1 (2.77 KB, patch) 2009-09-29 13:36 UTC, Lon Hohberger	no flags	Details \| Diff
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2010:0266	0	normal	SHIPPED_LIVE	cman bug fix and enhancement update	2010-03-29 12:54:44 UTC

Description Mattia Gandolfi 2007-11-09 15:31:58 UTC

Description of problem:
When setting up a 2-nodes cluster I need a way to avoid fencing-race conditions.
Since 5.1 qdisk should work with a "master-wins" logic when no heuristic is
configured at all, so that, when the heatbeat channel (in my case a crossover
cable) fails, only the slave node gets fenced.

Version-Release number of selected component (if applicable):
5.1 - i386 - all errata applied as of 11/9/2007

How reproducible:
always

Steps to Reproduce:
1. set up a 2nodes-cluster, using the attached cluster.conf as a reference
(configuration is very simple: 2 noded, 2 HP ILO fencing devices, 1 quorum disk)
2. don't use the same network for service and heartbeat traffic (simply use a
crossover cable on a secondary interface as the heartbeat channel)
3.unplug the crossover cable
  
Actual results:
Both nodes try to fence each others. as a result both get rebooted at the same time

Expected results:
Only one node (the Master) should try to fence the second one

Additional info:

Comment 1 Mattia Gandolfi 2007-11-09 15:31:58 UTC

Created attachment 252891 [details]
Cluster.conf

Comment 3 Lon Hohberger 2008-05-08 21:24:54 UTC

Pushed for 5.3

http://sources.redhat.com/git/?p=cluster.git;a=commit;h=c1b276c491b0d6e625035b5063532abc3ce23ca4

Comment 5 Lon Hohberger 2008-10-22 20:23:06 UTC

Comment #3 is wrong; this bug was not fixed; it was pushed to the wrong bugzilla.

Comment 6 Lon Hohberger 2008-10-22 20:26:40 UTC

Fixing bug state.

Comment 7 Lon Hohberger 2009-04-01 19:34:03 UTC

Qdiskd normally times out before CMAN, so qdiskd can't change its votes as a function of CMAN transitions, because by then, fencing will have already been started.

The simplest thing we can do is make a pseudo fence_qdiskd which hangs forever if we are not the master, and exits successfully if we are.

A more complex (to implement) but easier-to-configure solution would be (as Eduardo suggested) to allow fenced and qdiskd to communicate.

Unfortunately, there is no API or command for talking with qdiskd, so both of the possible solutions would have to have a qdiskd API designed.

A workaround exists (though sub-optimal):

Create a fencing agent which does nothing but sleep a few seconds and add it to *one* of the two cluster nodes.  This node will always lose in a network partition (say cable pull between two nodes).

e.g.

  #!/bin/sh  
  # /sbin/fence_sleep - sleep for 5 seconds so we lose
  sleep 5
  exit 0

---
  <clusternodes> 
    <clusternode name="node1"> 
      <fence method name="1">
        <device name="delay"/>
        <device name="ilo-node1" .../>
      </fence> 
    </clusternode> 
    <clusternode name="node2"> 
      <fence method name="1">
        <device name="ilo-node2" .../>
      </fence> 
    </clusternode> 
  </clusternodes>
  <fencedevices>
    <fencedevice name="delay" agent="/sbin/fence_sleep"/>
    <fencedevice name="ilo-node1" ... />
    <fencedevice name="ilo-node2" ... />
  </fencedevices>
---

Another workaround is to use heuristics.

The only use case where the above workaround is really absolutely required to ensure a win to the fence-race is a network partition in a 2-node cluster using a crossover cable.

Because workarounds exist, I am moving this off to 5.5 for now, with a conditional NAK due to the fact that it will require quite a bit of design work.

Comment 8 Gianluca Cecchi 2009-05-19 09:30:10 UTC

Why not be able to consider the production network like as a backup intra-cluster one? Or be able to configure more than one intracluster network?
Any way to configure an heuristic where we get the node that is master of qdisk at that moment and let it survive over the other one? Any commands for this, apart from parsing log files for lines of kind
qdiskd[6238]: <info> Node 2 is the master 

(btw donna if this is true only with logging enabled or in all cases...)
Thanks
Gianluca

Comment 9 Lon Hohberger 2009-07-09 18:49:03 UTC

Ok - Eduardo and I have a design for 2-node cluster master-wins, which is provably correct; it simply involves the slave not advertising to the local instance of CMAN of the qdiskd votes.

With correct qdiskd configuration, this nets the following behaviors in a loss of communication:

(1) network outage: slave has not been advertising its qdiskd votes to cman, and therefore loses quorum and is fenced by the master

(2) master died: slave qdiskd becomes master before CMAN notices the node loss

Comment 10 Lon Hohberger 2009-09-29 13:36:21 UTC

Created attachment 363004 [details]
Pass 1

Comment 11 Gianluca Cecchi 2009-10-13 15:50:27 UTC

Any expected release date for these changes or any timing for expected QA tests?
Thanks,
Gianluca

Comment 13 Federico Simoncelli 2009-11-03 15:44:20 UTC

I used the patch against cman-2.0.115-1.el5_4.3 and it's working fine: in the 2-node cluster with no heuristics defined the master wins. I'm going to install it on production servers too.
I'm looking forward to see this in an update for the 5.4.

Comment 15 Lon Hohberger 2009-11-03 20:53:41 UTC

http://git.fedorahosted.org/git/?p=cluster.git;a=commit;h=2d6c88823e2f2e663d4499769152cc0d21644d34

Reassigning to component owner for build.

Comment 16 Gianluca Cecchi 2009-11-04 06:46:16 UTC

Ok - to apply and test on my updated rh el 5.4 test cluster, where do I have to get the patches for qdisk.5, qdisk.h and qdisk.c?
From git yesterday or from the patchfile of the end of September?
Are they the same or do other changes come in the mean time?

Comment 17 Federico Simoncelli 2009-11-04 09:11:07 UTC

Sadly I had no way to test what happen if you suddenly power-off the master. The expected result is that the remaining node gets the qdisks votes and becomes master without losing the quorum (not even for a short time). Can anyone test this?

Comment 18 Lon Hohberger 2009-11-12 18:16:34 UTC

Federico, yes -- but:

If your cman timings are wrong, and the master failover does not occur prior to the node "noticing" the rebooted master died, you will lose quorum.

Comment 19 Lon Hohberger 2009-11-12 19:01:50 UTC

Example expected behaviors:

Qdiskd master node:

[root@molly ~]# cman_tool status
Version: 6.2.0
Config Version: 2781
Cluster Name: lolcats
Cluster Id: 13719
Cluster Member: Yes
Cluster Generation: 1284
Membership state: Cluster-Member
Nodes: 2
Expected votes: 3
Quorum device votes: 1
Total votes: 3
Quorum: 2  
Active subsystems: 8
Flags: Dirty 
Ports Bound: 0  
Node name: molly
Node ID: 1
Multicast addresses: 225.0.0.13 
Node addresses: 192.168.122.4 

Qdiskd non-master:

[root@frederick ~]# cman_tool status
Version: 6.2.0
Config Version: 2781
Cluster Name: lolcats
Cluster Id: 13719
Cluster Member: Yes
Cluster Generation: 1284
Membership state: Cluster-Member
Nodes: 2
Expected votes: 3
Total votes: 2
Quorum: 2  
Active subsystems: 8
Flags: Dirty 
Ports Bound: 0  
Node name: frederick
Node ID: 2
Multicast addresses: 225.0.0.13 
Node addresses: 192.168.122.5 

Notice the total votes is NOT the same.  This is >>CORRECT<< for master-wins.

Failure test #1: hard-poweroff of 'molly', qdiskd master:

Nov 12 14:00:29 frederick qdiskd[1916]: <info> Assuming master role 
Nov 12 14:00:30 frederick qdiskd[1916]: <notice> Writing eviction notice for node 1 
Nov 12 14:00:30 frederick kernel: dlm: closing connection to node 1
Nov 12 14:00:31 frederick qdiskd[1916]: <notice> Node 1 evicted 
Nov 12 14:00:35 frederick openais[2522]: [TOTEM] The token was lost in the OPERATIONAL state. 

... this means 'frederick' took over qdiskd master role before CMAN noticed the master node was dead.  This is what we want.

Comment 20 Lon Hohberger 2009-11-12 20:06:54 UTC

Failure test #2: Kill network between hosts (make sure fencing device is still accessible):

Non-master:

Nov 12 15:02:32 molly openais[2664]: [TOTEM] The token was lost in the OPERATIONAL state.
Nov 12 15:02:32 molly openais[2664]: [TOTEM] Receive multicast socket recv buffer size (258048 bytes).
Nov 12 15:02:32 molly openais[2664]: [TOTEM] Transmit multicast socket send buffer size (258048 bytes).
Nov 12 15:02:32 molly openais[2664]: [TOTEM] entering GATHER state from 2.
Nov 12 15:02:36 molly openais[2664]: [TOTEM] entering GATHER state from 0.
Nov 12 15:03:42 molly syslogd 1.4.1: restart.


Qdiskd master:

Nov 12 15:02:31 frederick openais[2522]: [TOTEM] The token was lost in the OPERATIONAL state.
Nov 12 15:02:31 frederick openais[2522]: [TOTEM] Receive multicast socket recv buffer size (258048 bytes).
Nov 12 15:02:31 frederick openais[2522]: [TOTEM] Transmit multicast socket send buffer size (258048 bytes).
Nov 12 15:02:31 frederick openais[2522]: [TOTEM] entering GATHER state from 2.
Nov 12 15:02:36 frederick openais[2522]: [TOTEM] entering GATHER state from 0. 
Nov 12 15:02:36 frederick openais[2522]: [TOTEM] Creating commit token because I am the rep. 
Nov 12 15:02:36 frederick openais[2522]: [TOTEM] Saving state aru 23 high seq received 23 
Nov 12 15:02:36 frederick openais[2522]: [TOTEM] Storing new sequence id for ring 51c 
Nov 12 15:02:36 frederick openais[2522]: [TOTEM] entering COMMIT state. 
Nov 12 15:02:36 frederick openais[2522]: [TOTEM] entering RECOVERY state. 
Nov 12 15:02:36 frederick openais[2522]: [TOTEM] position [0] member 192.168.122.5: 
Nov 12 15:02:36 frederick openais[2522]: [TOTEM] previous ring seq 1304 rep 192.168.122.4 
Nov 12 15:02:36 frederick openais[2522]: [TOTEM] aru 23 high delivered 23 received flag 1 
Nov 12 15:02:36 frederick openais[2522]: [TOTEM] Did not need to originate any messages in recovery. 
Nov 12 15:02:36 frederick openais[2522]: [TOTEM] Sending initial ORF token 
Nov 12 15:02:36 frederick fenced[2549]: molly not a cluster member after 0 sec post_fail_delay
Nov 12 15:02:36 frederick kernel: dlm: closing connection to node 1
Nov 12 15:02:36 frederick fenced[2549]: fencing node "molly"
Nov 12 15:02:36 frederick openais[2522]: [CLM  ] CLM CONFIGURATION CHANGE 
Nov 12 15:02:36 frederick openais[2522]: [CLM  ] New Configuration: 
Nov 12 15:02:36 frederick openais[2522]: [CLM  ]        r(0) ip(192.168.122.5)  
Nov 12 15:02:36 frederick openais[2522]: [CLM  ] Members Left: 
Nov 12 15:02:36 frederick openais[2522]: [CLM  ]        r(0) ip(192.168.122.4)  
Nov 12 15:02:36 frederick openais[2522]: [CLM  ] Members Joined: 
Nov 12 15:02:36 frederick openais[2522]: [CLM  ] CLM CONFIGURATION CHANGE 
Nov 12 15:02:36 frederick openais[2522]: [CLM  ] New Configuration: 
Nov 12 15:02:36 frederick openais[2522]: [CLM  ]        r(0) ip(192.168.122.5)  
Nov 12 15:02:36 frederick openais[2522]: [CLM  ] Members Left: 
Nov 12 15:02:36 frederick openais[2522]: [CLM  ] Members Joined: 
Nov 12 15:02:36 frederick openais[2522]: [SYNC ] This node is within the primary component and will provide service. 
Nov 12 15:02:36 frederick openais[2522]: [TOTEM] entering OPERATIONAL state. 
Nov 12 15:02:36 frederick openais[2522]: [CLM  ] got nodejoin message 192.168.122.5 
Nov 12 15:02:36 frederick openais[2522]: [CPG  ] got joinlist message from node 2 
Nov 12 15:02:38 frederick fenced[2549]: fence "molly" success
Nov 12 15:02:48 frederick qdiskd[1916]: <notice> Writing eviction notice for node 1 
Nov 12 15:02:49 frederick qdiskd[1916]: <notice> Node 1 evicted 



Note that qdiskd notices the death -after- CMAN in this case, which is expected behavior.  since qdiskd was still operating on both nodes, it did not notice the other instance of qdiskd going away until after CMAN had expired the node.

Comment 21 Lon Hohberger 2009-11-12 20:08:42 UTC

(In reply to comment #19)
> Example expected behaviors:
> 
> Qdiskd master node:
> 

Clustat output will show the quorum disk as 'online' only on the qdiskd master node:

[root@molly ~]# clustat
Cluster Status for lolcats @ Thu Nov 12 15:07:43 2009
Member Status: Quorate

 Member Name                                               ID   Status
 ------ ----                                               ---- ------
 molly                                                         1 Online, Local
 frederick                                                     2 Online
 /dev/hdb1                                                     0 Offline, Quorum Disk

[root@frederick ~]# clustat
Cluster Status for lolcats @ Thu Nov 12 15:05:12 2009
Member Status: Quorate

 Member Name                                               ID   Status
 ------ ----                                               ---- ------
 molly                                                         1 Online
 frederick                                                     2 Online, Local
 /dev/hdb1                                                     0 Online, Quorum Disk


This is expected behavior per design.  Only if the master node fails will the quorum disk become 'online' on the other cluster member.

Comment 22 Lon Hohberger 2009-11-12 20:09:31 UTC

Other notes about master_wins mode:

* two node clusters only
* configuration of a heuristic will disable master_wins mode

Comment 24 Chris Ward 2010-02-11 10:03:03 UTC

~~ Attention Customers and Partners - RHEL 5.5 Beta is now available on RHN ~~

RHEL 5.5 Beta has been released! There should be a fix present in this 
release that addresses your request. Please test and report back results 
here, by March 3rd 2010 (2010-03-03) or sooner.

Upon successful verification of this request, post your results and update 
the Verified field in Bugzilla with the appropriate value.

If you encounter any issues while testing, please describe them and set 
this bug into NEED_INFO. If you encounter new defects or have additional 
patch(es) to request for inclusion, please clone this bug per each request
and escalate through your support representative.

Comment 26 Issue Tracker 2010-02-17 07:04:08 UTC

Event posted on 02-17-2010 04:04pm JST by tumeya

HP verified the 5.5 beta. 


This event sent from IssueTracker by tumeya 
 issue 379395

Comment 28 errata-xmlrpc 2010-03-30 08:40:23 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2010-0266.html

Comment 30 Gianluca Cecchi 2010-07-27 08:00:05 UTC

With these rpm versions (latest ones as of 5.5 branch):
cman-2.0.115-34.el5_5.1
openais-0.80.6-16.el5_5.2
rgmanager-2.0.52-6.el5

has the situation been reversed again?
In fact with a cluster.conf like this:
<cluster alias="oradwhstud" config_version="6" name="oradwhstud">
        <totem token="162000"/>
        <cman quorum_dev_poll="80000" expected_votes="3" two_node="0"/>
        <fence_daemon clean_start="1" post_fail_delay="0" post_join_delay="20"/>
..
        <quorumd device="/dev/mapper/mpath0" interval="5" label="dwhstudquorum" log_facility="local4" log_level="7" tko="16" votes="1">
        </quorumd>
...
so without heuristic,

I get again
Quorum device votes: 1
for both the nodes

On node master for quorum:
Jul 26 18:35:18 oratest1 openais[7313]: [CLM  ] got nodejoin message 192.168.16.22 
Jul 26 18:35:18 oratest1 openais[7313]: [CLM  ] got nodejoin message 192.168.16.21 
Jul 26 18:35:18 oratest1 openais[7313]: [CPG  ] got joinlist message from node 1 
Jul 26 18:35:48 oratest1 qdiskd[7343]: <debug> Node 2 is UP 

On the second started one:
Jul 26 18:35:44 oratest2 qdiskd[6644]: <debug> Node 1 is UP 
Jul 26 18:35:49 oratest2 qdiskd[6644]: <info> Node 1 is the master 
Jul 26 18:35:55 oratest2 openais[6614]: [TOTEM] Retransmit List: 24  
Jul 26 18:35:55 oratest2 openais[6614]: [TOTEM] Retransmit List: 27  
Jul 26 18:36:00 oratest2 openais[6614]: [TOTEM] Retransmit List: 28  
Jul 26 18:36:00 oratest2 openais[6614]: [TOTEM] Retransmit List: 29  
Jul 26 18:36:05 oratest2 openais[6614]: [TOTEM] Retransmit List: 2b  
Jul 26 18:36:05 oratest2 openais[6614]: [TOTEM] Retransmit List: 2d  
Jul 26 18:36:05 oratest2 openais[6614]: [TOTEM] Retransmit List: 30  
Jul 26 18:36:05 oratest2 openais[6614]: [TOTEM] Retransmit List: 32  
Jul 26 18:36:40 oratest2 qdiskd[6644]: <info> Initial score 1/1 
Jul 26 18:36:40 oratest2 qdiskd[6644]: <info> Initialization complete 
Jul 26 18:36:40 oratest2 openais[6614]: [CMAN ] quorum device registered 
Jul 26 18:36:40 oratest2 qdiskd[6644]: <notice> Score sufficient for master operation (1/1; required=1); upgrading

If I interrupt intra-cluster network I get again mutual fencing...
Thanks,
Gianluca

Comment 31 Lon Hohberger 2010-11-02 14:58:18 UTC

Gainluca, you forgot to set master_wins="1" in <quorumd> tag.

Comment 32 Gianluca Cecchi 2010-11-02 15:58:14 UTC

OK,
I missed that particular, as it was inside the man page attachment of comment#10 ...
Now I see it is not set by default...
Thanks

Note You need to log in before you can comment on or make changes to this bug.