Bug 705356

Summary: Nodes fence each other in two nodes cluster without two_node=1
Product: Red Hat Enterprise Linux 6 Reporter: Jaroslav Kortus <jkortus>
Component: clusterAssignee: Fabio Massimo Di Nitto <fdinitto>
Status: CLOSED ERRATA QA Contact: Cluster QE <mspqa-list>
Severity: high Docs Contact:
Priority: high    
Version: 6.0CC: ccaulfie, cluster-maint, djansa, fdinitto, lhh, rpeterso, teigland
Target Milestone: rc   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: corosync-1.2.3-36.el6 Doc Type: Bug Fix
Doc Text:
Do not document.
Story Points: ---
Clone Of: Environment:
Last Closed: 2011-12-06 14:52:00 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:

Description Jaroslav Kortus 2011-05-17 13:07:40 UTC
Description of problem:
Nodes fence each other when there should not be the quorum to support that action.

Version-Release number of selected component (if applicable):
corosync-1.2.3-21.el6_0.2.x86_64
cman-3.0.12-23.el6_0.7.x86_64

How reproducible:
100%

Steps to Reproduce:
1. Start 2-node cluster without any extra options, each node with votes="1"
2. pkill -9 corosync on node1
3. service cman stop; service cman start on node1
4. watch node2 gaining quorum and killing node1
  
Actual results:
node fenced

Expected results:
no fencing

Additional info:

<?xml version="1.0"?>
<cluster name="Z_Cluster" config_version="1">
  <cman>
		</cman>
  <fence_daemon post_join_delay="20" clean_start="0"/>
  <clusternodes>
    <clusternode name="z2" votes="1" nodeid="2">
      <fence>
        <method name="APC">
						<device name="apc" port="2"/>
					</method>
      </fence>
    </clusternode>
    <clusternode name="z4" votes="1" nodeid="4">
      <fence>
        <method name="WTI">
						<device name="wti" port="B1"/>
					</method>
      </fence>
    </clusternode>
  </clusternodes>
  <fencedevices>
    <fencedevice name="apc" agent="fence_apc" ipaddr="x" login="x" passwd="x"/>
    <fencedevice name="wti" agent="fence_wti" ipaddr="y" login="y" passwd="y"/>
  </fencedevices>
</cluster>


Snap from node2 (pkill -9 corosync on node1):
May 17 07:49:49 z4 corosync[12390]:   [TOTEM ] A processor failed, forming new configuration.
May 17 07:49:51 z4 corosync[12390]:   [CMAN  ] quorum lost, blocking activity
May 17 07:49:51 z4 corosync[12390]:   [QUORUM] This node is within the non-primary component and will NOT provide any services.
May 17 07:49:51 z4 corosync[12390]:   [QUORUM] Members[1]: 4
May 17 07:49:51 z4 corosync[12390]:   [TOTEM ] A processor joined or left the membership and a new membership was formed.
May 17 07:49:51 z4 corosync[12390]:   [CPG   ] downlist received left_list: 1
May 17 07:49:51 z4 corosync[12390]:   [CPG   ] chosen downlist from node r(0) ip(10.15.89.17) 
May 17 07:49:51 z4 corosync[12390]:   [MAIN  ] Completed service synchronization, ready to provide service.
May 17 07:49:51 z4 kernel: dlm: closing connection to node 2





node2 (service cman start on node1):
May 17 07:50:29 z4 corosync[12390]:   [TOTEM ] A processor joined or left the membership and a new membership was formed.
May 17 07:50:29 z4 corosync[12390]:   [CMAN  ] quorum regained, resuming activity
May 17 07:50:29 z4 corosync[12390]:   [QUORUM] This node is within the primary component and will provide service.
May 17 07:50:29 z4 corosync[12390]:   [QUORUM] Members[2]: 2 4
May 17 07:50:29 z4 corosync[12390]:   [QUORUM] Members[2]: 2 4
May 17 07:50:30 z4 corosync[12390]:   [CPG   ] downlist received left_list: 0
May 17 07:50:30 z4 corosync[12390]:   [CPG   ] downlist received left_list: 0
May 17 07:50:30 z4 corosync[12390]:   [CPG   ] chosen downlist from node r(0) ip(10.15.89.15) 
May 17 07:50:30 z4 corosync[12390]:   [MAIN  ] Completed service synchronization, ready to provide service.
May 17 07:50:30 z4 fenced[12446]: fencing node z2
May 17 07:50:35 z4 fenced[12446]: fence z2 success
May 17 07:50:41 z4 corosync[12390]:   [TOTEM ] A processor failed, forming new configuration.
May 17 07:50:43 z4 corosync[12390]:   [CMAN  ] quorum lost, blocking activity

Comment 2 Fabio Massimo Di Nitto 2011-05-18 08:23:13 UTC
Hi,

I have tested this setup and I can reproduce the behavior described.

This bug is effectively not a bug. The configuration is unsupported and not tested.

In a 2 nodes cluster, either the configuration must contain the
<cman two_node="1" expected_vote="1"/>
config entry or quorum disk config.

I agree with Jan that in theory we prevent this situation from happening, and the only options we have are:

1) refuse to start cman if node count is 2 and there is no two_node/expected_votes or quorum disk

2) automatically enable two_nodes/expected_votes if node count is 2 and qdisk is missing.

Lon, we can either close this bug as NOTABUG/WONTFIX or implement either 1 or 2 as suggested above. Any preference?

Comment 3 David Teigland 2011-05-18 14:56:07 UTC
Fabio, a two node cluster with normal quorum is very legitimate and will work just fine.  If that doesn't work properly, nothing else will either.

The original report does not look correct, though.  It's not consistent with what we expect or what I see when I do it.  I get the expected result, which is:

node1,node2: service cman start
node1: pkill -9 corosync
node2: loses quorum, you should see this:

[root@bull-02 ~]# group_tool -n
fence domain
member count  1   
victim count  1   
victim now    0   
master nodeid 1   
wait state    quorum
members       1 2 
all nodes
nodeid 1 member 0 victim 1 last fence master 0 how none
nodeid 2 member 1 victim 0 last fence master 0 how none

node1: service cman start
node2: gets quorum, sees node1 has rejoined cleanly, so skips running the fence agent action against it (it no longer needs fencing since it just rebooted cleanly), you should see this:

[root@bull-02 ~]# group_tool -n
fence domain
member count  2
victim count  0
victim now    0
master nodeid 2
wait state    none
members       1 2
all nodes
nodeid 1 member 1 victim 0 last fence master 2 how member
nodeid 2 member 1 victim 0 last fence master 0 how none


In the original report, it appears that node2 actually went ahead and carried out the fencing action against node1 instead of skipping it.  "fence_tool dump" from node2 would probably show us why node1 did not escape fencing when we expect it would due to rejoining cleanly.

Keep in mind that if the dlm is being used, then that changes things -- you would need to insert a reboot of node1 after killing corosync and before restarting cman to get this same result.

(If you don't reboot, then restarting the cman service should fail toward the end when it sees the last cluster was never cleaned up.  Restarting the cman service may get far enough, however, for node2 to briefly get quorum and execute the fence agent against node1.)

Comment 4 Fabio Massimo Di Nitto 2011-05-18 15:23:44 UTC
(In reply to comment #3)
> Fabio, a two node cluster with normal quorum is very legitimate and will work
> just fine.  If that doesn't work properly, nothing else will either.

Mind to explain what you mean here? I am not sure I am on the same page.

> 
> The original report does not look correct, though.  It's not consistent with
> what we expect or what I see when I do it.  I get the expected result, which
> is:
> 
> node1,node2: service cman start
> node1: pkill -9 corosync
> node2: loses quorum, you should see this:
> 
> [root@bull-02 ~]# group_tool -n
> fence domain
> member count  1   
> victim count  1   
> victim now    0   
> master nodeid 1   
> wait state    quorum
> members       1 2 
> all nodes
> nodeid 1 member 0 victim 1 last fence master 0 how none
> nodeid 2 member 1 victim 0 last fence master 0 how none
> 
> node1: service cman start
> node2: gets quorum, sees node1 has rejoined cleanly, so skips running the fence
> agent action against it (it no longer needs fencing since it just rebooted
> cleanly), you should see this:
> 
> [root@bull-02 ~]# group_tool -n
> fence domain
> member count  2
> victim count  0
> victim now    0
> master nodeid 2
> wait state    none
> members       1 2
> all nodes
> nodeid 1 member 1 victim 0 last fence master 2 how member
> nodeid 2 member 1 victim 0 last fence master 0 how none
> 
> 
> In the original report, it appears that node2 actually went ahead and carried
> out the fencing action against node1 instead of skipping it.  "fence_tool dump"
> from node2 would probably show us why node1 did not escape fencing when we
> expect it would due to rejoining cleanly.

If node1 left the cluster, why is it still part of "fence_tool ls" from node2?

I noticed that node1 fenced did not exit/die after killing corosync. All others daemons were gone. In one test I left it there, in another run I first killed it.

No nodes reboot have happened in between.

Starting cman did not even complete (I got to starting dlm_controld when fence did kick in) but that could just be a timing matter.

> 
> Keep in mind that if the dlm is being used, then that changes things -- you
> would need to insert a reboot of node1 after killing corosync and before
> restarting cman to get this same result.

I didn't use any service on top. It was a clean cman start operation with no extra config. Tho there were "dlm connection" entries in the log (gfs_controld? dlm_controld?)

So are you suggesting there is a bug in fenced that does indeed execute a fence action when it's not supposed to?

I can grab fence_tool dump and logs tomorrow.

Comment 5 David Teigland 2011-05-18 15:57:12 UTC
> Mind to explain what you mean here? I am not sure I am on the same page.

Sure, a cluster with 2 nodes should really behave no differently from one with 4,6,8 or 16 nodes; the quorum algorithm is the same when half the nodes are gone.  The two_node and qdisk options are the special case hacks that subvert standard quorum behavior and bring us various complicated side effects instead of the "clean" behavior of normal quorum.

> If node1 left the cluster, why is it still part of "fence_tool ls" from node2?

node1 left the cluster prior to the first fence_tool ls, which shows it is not a member.  node1 rejoined the cluster prior to the second fence_tool ls, which shows it is a member.

> I noticed that node1 fenced did not exit/die after killing corosync. All others
> daemons were gone.

Ah, good catch.  fenced should definately exit, if it's not, then that's something we should fix, I'd start by stracing it.  If fenced does not exit, and then you run service cman start a second time, the results are going to be undefined and not something we want to deal with.  I'm guessing that explains what happened here -- fenced did not exit from the first cluster start, which means the result (node1 got fenced) is the correct one.

I've seen bugs like this in the past (daemon not exiting when corosync dies), and they have most often been stuck in a corosync library.

Comment 6 Fabio Massimo Di Nitto 2011-05-18 17:05:07 UTC
(In reply to comment #5)
> > Mind to explain what you mean here? I am not sure I am on the same page.
> 
> Sure, a cluster with 2 nodes should really behave no differently from one with
> 4,6,8 or 16 nodes; the quorum algorithm is the same when half the nodes are
> gone.  The two_node and qdisk options are the special case hacks that subvert
> standard quorum behavior and bring us various complicated side effects instead
> of the "clean" behavior of normal quorum.

Ok I understand what you are saying. Just a slightly different view than mine where I see a 2 nodes cluster something that should never block on one node going away (from an production/operational point of view).

> 
> > If node1 left the cluster, why is it still part of "fence_tool ls" from node2?
> 
> node1 left the cluster prior to the first fence_tool ls, which shows it is not
> a member.  node1 rejoined the cluster prior to the second fence_tool ls, which
> shows it is a member.
> 

Right, the node count is correct but node1 is still shown as member in both. Is that expected behaviour?

> > I noticed that node1 fenced did not exit/die after killing corosync. All others
> > daemons were gone.
> 
> Ah, good catch.  fenced should definately exit, if it's not, then that's
> something we should fix, I'd start by stracing it.  If fenced does not exit,
> and then you run service cman start a second time, the results are going to be
> undefined and not something we want to deal with.  I'm guessing that explains
> what happened here -- fenced did not exit from the first cluster start, which
> means the result (node1 got fenced) is the correct one.
> 
> I've seen bugs like this in the past (daemon not exiting when corosync dies),
> and they have most often been stuck in a corosync library.

Only partially because I tested both conditions.

In the first run fenced was still hanging around and I left it there (knowing the results are unpredictable).

In another run, fenced was still there, but I killed explicitly and then executed cman start, with the same ending of node2 fencing node1

As for handling exit paths, we will need to strace/gdb where we hang. Can you reproduce it locally or do I need to provide the info?

Comment 7 Fabio Massimo Di Nitto 2011-05-27 07:49:03 UTC
(In reply to comment #5)

> > I noticed that node1 fenced did not exit/die after killing corosync. All others
> > daemons were gone.
> 
> Ah, good catch.  fenced should definately exit, if it's not, then that's
> something we should fix, I'd start by stracing it.  If fenced does not exit,
> and then you run service cman start a second time, the results are going to be
> undefined and not something we want to deal with.  I'm guessing that explains
> what happened here -- fenced did not exit from the first cluster start, which
> means the result (node1 got fenced) is the correct one.
> 
> I've seen bugs like this in the past (daemon not exiting when corosync dies),
> and they have most often been stuck in a corosync library.

We will need to use another approach to debug this exit issue. When stracing the daemon, it does always exit after pkill -9 corosync. Clearly somekind of race condition somewhere.

I suspect, but I can“t be 100% sure, that the fencing action is taken because fenced does not exit correctly. Since I started killing the daemon and waited a bit of time before restarting cman, I never experienced the fence action described in the original report.

Comment 8 David Teigland 2011-05-27 14:03:09 UTC
I think you need to start strace only after the daemon gets stuck (unless that's what you did.)

Comment 10 Fabio Massimo Di Nitto 2011-07-01 08:58:28 UTC
(In reply to comment #3)

> Keep in mind that if the dlm is being used, then that changes things -- you
> would need to insert a reboot of node1 after killing corosync and before
> restarting cman to get this same result.

I checked again and indeed dlm is in use somehow.

so pkill -9 corosync -> reboot node -> cman start -> no fencing.

Running on RHEL6.2 fenced does exit. I suspect that corosync IPC might have changed (one fix has been posted to openais mailing list that describe this exact behavior)

[root@rhel6-node1 ~]# rpm -q -f /usr/sbin/corosync
corosync-1.2.3-36.el6

Comment 11 David Teigland 2011-07-01 14:38:46 UTC
OK it sounds like everything is fine now and we can close this.

Comment 12 Fabio Massimo Di Nitto 2011-07-01 14:47:37 UTC
(In reply to comment #11)
> OK it sounds like everything is fine now and we can close this.

I suggest we keep it open a bit longer and ask QE to see if they see similar issues in their test environments. Assuming is a race condition, my only environment is not authoritative enough.

Comment 16 Jaroslav Kortus 2011-10-24 16:03:40 UTC
I can't reproduce this with latest 6.2 any more:

Oct 24 10:59:17 marathon-02 corosync[16767]:   [CMAN  ] quorum regained, resuming activity
Oct 24 10:59:17 marathon-02 corosync[16767]:   [QUORUM] This node is within the primary component and will provide service.
Oct 24 10:59:17 marathon-02 corosync[16767]:   [QUORUM] Members[2]: 1 2
Oct 24 10:59:17 marathon-02 corosync[16767]:   [QUORUM] Members[2]: 1 2
Oct 24 10:59:17 marathon-02 corosync[16767]:   [CPG   ] chosen downlist: sender r(0) ip(10.15.89.72) ; members(old:1 left:0)
Oct 24 10:59:17 marathon-02 corosync[16767]:   [MAIN  ] Completed service synchronization, ready to provide service.
Oct 24 10:59:21 marathon-02 fenced[16825]: fenced 3.0.12.1 started
Oct 24 10:59:21 marathon-02 dlm_controld[16846]: dlm_controld 3.0.12.1 started
Oct 24 10:59:22 marathon-02 gfs_controld[16896]: gfs_controld 3.0.12.1 started
Oct 24 10:59:22 marathon-02 fence_node[16905]: unfence marathon-02 success
Oct 24 11:00:05 marathon-02 corosync[16767]:   [TOTEM ] A processor failed, forming new configuration.
Oct 24 11:00:07 marathon-02 corosync[16767]:   [CMAN  ] quorum lost, blocking activity
Oct 24 11:00:07 marathon-02 corosync[16767]:   [QUORUM] This node is within the non-primary component and will NOT provide any services.
Oct 24 11:00:07 marathon-02 corosync[16767]:   [QUORUM] Members[1]: 2
Oct 24 11:00:07 marathon-02 corosync[16767]:   [TOTEM ] A processor joined or left the membership and a new membership was formed.
Oct 24 11:00:07 marathon-02 corosync[16767]:   [CPG   ] chosen downlist: sender r(0) ip(10.15.89.72) ; members(old:2 left:1)
Oct 24 11:00:07 marathon-02 corosync[16767]:   [MAIN  ] Completed service synchronization, ready to provide service.
Oct 24 11:00:07 marathon-02 kernel: dlm: closing connection to node 1
Oct 24 11:00:17 marathon-02 corosync[16767]:   [TOTEM ] A processor joined or left the membership and a new membership was formed.
Oct 24 11:00:17 marathon-02 corosync[16767]:   [CMAN  ] quorum regained, resuming activity
Oct 24 11:00:17 marathon-02 corosync[16767]:   [QUORUM] This node is within the primary component and will provide service.
Oct 24 11:00:17 marathon-02 corosync[16767]:   [QUORUM] Members[2]: 1 2
Oct 24 11:00:17 marathon-02 corosync[16767]:   [QUORUM] Members[2]: 1 2
Oct 24 11:00:17 marathon-02 corosync[16767]:   [CPG   ] chosen downlist: sender r(0) ip(10.15.89.72) ; members(old:1 left:0)
Oct 24 11:00:17 marathon-02 corosync[16767]:   [MAIN  ] Completed service synchronization, ready to provide service.


This is expected behaviour. Marking this as verified with:
cman-3.0.12.1-23.el6.x86_64
corosync-1.4.1-4.el6.x86_64

Comment 17 Fabio Massimo Di Nitto 2011-10-27 08:14:47 UTC
    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
Do not document.

Comment 18 errata-xmlrpc 2011-12-06 14:52:00 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2011-1516.html