Bug 684838

Summary: After fencing of a failed node 'fenced' daemon leaves 'fenced:default' cpg on all remaining nodes
Product: Red Hat Enterprise Linux 6 Reporter: Andrew Beekhof <abeekhof>
Component: pacemakerAssignee: Andrew Beekhof <abeekhof>
Status: CLOSED ERRATA QA Contact: Cluster QE <mspqa-list>
Severity: urgent Docs Contact:
Priority: low    
Version: 6.1CC: abeekhof, agk, andrew, bubble, cluster-maint, djansa, fdinitto, lhh, sdake, swhiteho, teigland
Target Milestone: rcKeywords: TechPreview
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: pacemaker-1.1.5-3.el6 Doc Type: Technology Preview
Doc Text:
Previously, when pacemaker fenced a node, it would restart the fence domain while attempting to notify other nodes of the fencing event. This would put the fence domain into an incorrect state, blocking any further recovery. With this update, an upstream patch has been applied to address this issue. As a result, node recovery completes successfully.
Story Points: ---
Clone Of: 664958 Environment:
Last Closed: 2011-05-19 13:49:42 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 664958    
Bug Blocks:    

Description Andrew Beekhof 2011-03-14 15:58:35 UTC
+++ This bug was initially created as a clone of Bug #664958 +++

Created attachment 470166 [details]
cluster.conf

Description of problem:
I'm evaluating 4-node cman+pacemaker cluster. As I'm interested in working fail-over even if some nodes die, I'm testing how cluster behave if I do reboot/ipmi_poweroff on one of nodes.
I noticed that shortly after cman realizes that powered off node is absent, all dlm-related operations stuck on other nodes forever. After failed node boots again, cluster begins fencing war which could be stopped only by synchronous power off/power on of all cluster nodes.
Further digging shows, that right after failed node is correctly fenced, fenced tries to join 'fenced:default' cpg, fails with 'join error: domain default exists' message and then does 'cpg_leave fenced:default ...' and 'confchg for our leave'.

Version-Release number of selected component (if applicable):
kernel-2.6.34.7-63.fc13.x86_64
corosync-1.3.0-1.fc13
openais-1.1.4-1.fc13
cman-3.1.0-6.fc13
lvm2-cluster-2.02.78-1.fc13 (built from fc15 srpm, results are the same with stock fc13 rpm)
gfs2-cluster-3.1.0-3.fc13
pacemaker-1.1.4-4.fc13

How reproducible:
100%

Steps to Reproduce:
1. Setup and power on 4-node cman cluster (dlm+gfs2+clvm).
2. Power one node off via IPMI.
  
Actual results:
Shortly after powered node is fenced, 'fence_tool ls' shows nothing. dlm-related resources stuck. cluster begins fencing war after fenced node reboots.

Expected results:
Fenced node joins cluster back after reboot.
Cluster resources run on remaining nodes until that.

Additional info:
I run clvmd and one gfs2 filesystem over dlm.
dlm communicates via sctp.
gfs2 filesystem is mounted by pacemaker.
corosync uses "udpu" transport.
Can't try with 3.0.17 because of multicast problems with cisco switches (no support for udpu in 3.0.17).

--- Additional comment from bubble on 2010-12-22 04:45:12 EST ---

Created attachment 470168 [details]
fence_tool dump from first of remaining nodes

--- Additional comment from bubble on 2010-12-22 04:46:32 EST ---

Created attachment 470169 [details]
fence_tool dump from second of remaining nodes

--- Additional comment from bubble on 2010-12-22 04:47:32 EST ---

Created attachment 470170 [details]
fence_tool dump from the last of remaining nodes

--- Additional comment from bubble on 2010-12-22 05:04:45 EST ---

Forgot to mention that if I issue 'fence_tool join' (on all remaining nodes) shortly after 'fence_tool ls' becomes empty, everything works as it should.

--- Additional comment from teigland on 2011-01-03 11:33:17 EST ---

You need to come at this from the opposite direction: assume nothing works and add things until it breaks.  i.e. get rid of udpu and sctp, just start cman, and see how things work.  If they work perfectly, then add or change one small thing at a time and test, and see when something breaks.

(don't add udpu or sctp until the very end when everything else works perfectly, since those are highly experimental features.)

--- Additional comment from sdake on 2011-01-03 11:44:16 EST ---

Fabio,

does cman 3.1 have the udpu patches in it?

--- Additional comment from lhh on 2011-01-03 11:56:23 EST ---

CMAN's support of enabling udpu in corosync is in 3.1.0, yes:

http://git.fedorahosted.org/git?p=cluster.git;a=commit;h=c399f4c0f0d7cc4467a68c41715c404b2afb9425

--- Additional comment from bubble on 2011-01-04 03:10:55 EST ---

(In reply to comment #5)
Will try today. 

> (don't add udpu or sctp until the very end when everything else works
> perfectly, since those are highly experimental features.)

Unfortunately, I hit http://marc.info/?l=openais&m=129050993731907&w=2 when I try multicast, so udpu is the only solution for me until I get rid of that assert (followed by corosync stop on all nodes). I thought that c3750-x should be stable in all layer-2 use-cases, but this seems not to be in a place (unless Dejan was right in http://marc.info/?l=openais&m=129311034019165&w=2 and this is a corosync bug). I will try to simply remove that assert and try with mcast again.

And, I'd say that I was absolutely happy with udpu in corosync until .pcmk variants of dlm_controld and gfs_controld were removed in 3.1.0 and I was forced to switch to cman to stay in sync with upstream.

As a side note, there are no visible problems with sctp (comparing to tcp), I see all expected associations, connections are established and everything just works.

Dave, could you please look again at dumps, I suspect that code path I hit should not occur - fenced tries to join already joined cpg. What can be a root of this? DLM clients (clvmd, gfs2)? dlm_controld? fenced itself? At the same moment all other parts of cluster software work fine, just waiting for fence domain, and if I immediately join nodes to fence domain by hands, everything remains stable (dlm, clvm, gfs2, pacemaker).

Anyway, I will try the following (with forced power off of node at each step after cluster stabilizes):
1) Try cman without pacemaker and udpu, mount iscsi-exported GFS2 partition without clvm.
2) Rebuild corosync with that assert removed and try again if corosync fails.
3) Switch back to udpu if corosync still fails.
4) Remove gfs2, add clvm.
5) Add gfs2 on clvm lv.
6) Switch fencing handling back to pacemaker and run it.

This all will require some time (I need to rebuild PXE live image for each step), so please stay tuned.

--- Additional comment from teigland on 2011-01-04 10:42:41 EST ---

> Unfortunately, I hit http://marc.info/?l=openais&m=129050993731907&w=2 when I
> try multicast, so udpu is the only solution for me until I get rid of that
> assert (followed by corosync stop on all nodes)

That's due to rrp/sctp which you also need to disable.

> Dave, could you please look again at dumps, I suspect that code path I hit
> should not occur - fenced tries to join already joined cpg. What can be a root
> of this?

If turning off udpu/rrp/sctp doesn't fix things, then it sounds like a problem with the way pacemaker/cman/fencing are being integrated rather than a specific bug.

--- Additional comment from lhh on 2011-01-04 11:05:49 EST ---

Apart from udpu, you could use broadcast.  It's much better tested.  However, there is a bug in cman-preconfig (which I fixed yesterday) with respect to using transport="udpb"; so you will have to enable it the old way:

   <cman broadcast="yes" />

--- Additional comment from bubble on 2011-01-10 07:17:19 EST ---

Dave, you are right, this seems to be pacemaker-specific.

I tried almost every combination, and I was able to make cluster stable in every case unless pacemaker runs.
I was able to mount GFS2 partition on iSCSI resource, then I tried clustered LV on another iSCSI resource and mounted GFS2 from it, and at every step everything run smoothly. No matter multicast, udpu or broadcast, dlm over tcp or sctp, everything just works.

pacemaker is 1.1.4-0bc69fcea1d7c69f332e60d9da84538c08ac4f3c

Andrew, could you please take a look? I can provide any additional information on request. We can also move to mailing list if you prefer.

One more strange thing: after I run pacemaker and fencing war begins, crm still shows:
-------------
Last updated: Mon Jan 10 12:00:01 2011
Stack: cman
Current DC: v02-b - partition with quorum
Version: 1.1.4-0bc69fcea1d7c69f332e60d9da84538c08ac4f3c
4 Nodes configured, 4 expected votes
25 Resources configured.
============

Online: [ v02-b v02-c ]
OFFLINE: [ v02-a v02-d ]

 stonith-v02-a	(stonith:fence_ipmilan):	Started v02-b
 stonith-v02-c	(stonith:fence_ipmilan):	Started v02-b
 stonith-v02-d	(stonith:fence_ipmilan):	Started v02-b
-------------

Note "partition with quorum" with only half of nodes running.

--- Additional comment from fdinitto on 2011-01-10 07:22:31 EST ---

reassigning to pacemaker for Andrew to look at.

--- Additional comment from andrew on 2011-01-11 02:26:21 EST ---

There is a big difference between being exposed by pacemaker and actually being a pacemaker bug.

Its hard to image how running "fence_tool join" would work-around a bug in pacemaker.

--- Additional comment from bubble on 2011-01-11 03:46:50 EST ---

Andrew, I didn't say it is a pacemaker bug, but rather that it looks like a "pacemaker-specific" issue. And I simply need this all to work ;)

I should correct myself a bit, "fence_tool join" seems not to help every time. I definitely saw correct output from "cman_tool services" earlier, where all remaining nodes joined fencing domain again correctly. Now I see (I added one more node on external VM because I need to go further with this project):
On fencing master node (id 5, which didn't leave fencing domain for some reason - probably because fencing operation result didn't reached that node):
fence domain
member count  4
victim count  1
victim now    6
master nodeid 5
wait state    fencing
members       5 6 7 8 17 

On other nodes (7, 8, 17, after "fence_tool join"):
fence domain
member count  4
victim count  1
victim now    0
master nodeid 0
wait state    messages
members       

And this is everywhere:
name          clvmd
id            0x4104eefa
flags         0x00000004 kern_stop
change        member 5 joined 1 remove 0 failed 0 seq 5,5
members       5 6 7 8 17 
new change    member 4 joined 0 remove 1 failed 1 seq 6,6
new status    wait_messages 0 wait_condition 1 fencing
new members   5 7 8 17 

This behaviour could be related to whether fencing master and pacemaker stonith resource for the failed node reside on the same node or not - this time stonith resource was run elsewhere.

OK, so what do I need to do next to have this fixed?

--- Additional comment from andrew on 2011-01-14 05:21:46 EST ---

(In reply to comment #14)
> Andrew, I didn't say it is a pacemaker bug, but rather that it looks like a
> "pacemaker-specific" issue. And I simply need this all to work ;)

Not everyone appreciates the difference :-)

> This behaviour could be related to whether fencing master and pacemaker stonith
> resource for the failed node reside on the same node or not - this time stonith
> resource was run elsewhere.

I was wondering if this might be the case.

David, could you comment on this?
Is it possible to notify the "wrong" instance of fenced? If so, is this the sort of result you'd expect?

--- Additional comment from andrew on 2011-01-14 05:44:25 EST ---

I had a closer look at the Pacemaker code, it seems that every surviving node calls

	    rc = fenced_external(target_copy);

Is this bad?
If so, should we only be making that call on the fence leader or just any one peer?

--- Additional comment from teigland on 2011-01-14 11:43:33 EST ---

fenced_external() is meant to be called from just the node that did the fencing, although as I look at the code I don't see why calling it from everywhere would cause a problem.

Looking back at the comments, I get the impression that some attempt is being made to correct or clean up the condition of the cluster after the problems have begun.  If you want any hope of sorting this out, *don't do anything* after the node is killed, except for collecting the logs/status.  Otherwise, we have no idea which data is related to the original problem, and which is related to the attempt to resolve things.

The specific data I'd like to see is, after pacemaker/stonith have finished fencing the node, and have notified fenced (called fenced_external), without doing anything else, get the output of "fence_tool ls" and "fence_tool dump" from each of the nodes.

--- Additional comment from andrew on 2011-02-22 07:25:43 EST ---

Vladislav - any chance you could reproduce and give Dave the data he requested in comment #17?

--- Additional comment from bubble on 2011-02-22 07:44:19 EST ---

Ahm, If I knew how to do it without altering pacemaker code to satisfy "without
doing anything else" statement, I'd try on my new testing cluster.

Or I miss something and nothing is needed to be touched in code?

If yes, then dumps already attached should be close enough to what Dave requests.
I'm almost (99%) sure that I didn't do any manual intervention while gathering that dumps. I also have dlm_tool dump and gfs_control dump from all three survived nodes if they could be needed.

Cluster where I originally saw problem is already near production state (with .pcmk controld's rebuilt from 3.0.17) and I cannot switch stack there anymore.

Damn, I already rebuilt pacemaker without support for cman, and removed all cman bits from live image I use to boot cluster nodes (testing too) via PXE, so it would require some efforts from me to put it all back.

--- Additional comment from andrew on 2011-02-22 09:16:29 EST ---

(In reply to comment #19)
> Ahm, If I knew how to do it without altering pacemaker code to satisfy "without
> doing anything else" statement, I'd try on my new testing cluster.
> 
> Or I miss something and nothing is needed to be touched in code?

Correct. See the end of comment #17

[quote]
The specific data I'd like to see is, after pacemaker/stonith have finished
fencing the node, and have notified fenced (called fenced_external), without
doing anything else, get the output of "fence_tool ls" and "fence_tool dump"
from each of the nodes.
[/quote]

> If yes, then dumps already attached should be close enough to what Dave
> requests.

Close enough isn't always good enough I'm afraid.

--- Additional comment from bubble on 2011-02-22 09:32:46 EST ---

(In reply to comment #20)
> Close enough isn't always good enough I'm afraid.

I understand. Sorry, can't provide anything more valuable right now, see end of comment #19.

Comment 1 Andrew Beekhof 2011-03-14 16:00:25 UTC
We have a fix from upstream that David has verified to be correct.
Highly important that fencing works.

Comment 4 Jaromir Hradilek 2011-04-26 14:47:32 UTC
    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
Previously, rebooting a node in a CMAN managed cluster could cause the fencing daemon to keep the "fenced:default" CPG group on the remaining nodes, leaving the cluster in an inconsistent state. With this update, an upstream patch has been applied to address this issue. As a result, when a node is rebooted and leaves a cluster, the cluster resources correctly run on remaining nodes.

Comment 5 David Teigland 2011-04-26 15:48:40 UTC
    Technical note updated. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    Diffed Contents:
@@ -1 +1 @@
-Previously, rebooting a node in a CMAN managed cluster could cause the fencing daemon to keep the "fenced:default" CPG group on the remaining nodes, leaving the cluster in an inconsistent state. With this update, an upstream patch has been applied to address this issue. As a result, when a node is rebooted and leaves a cluster, the cluster resources correctly run on remaining nodes.+Previously, when pacemaker fenced a node, it would restart the fence domain while attempting to notify other nodes of the fencing event.  This would put the fence domain into an incorrect state, blocking any further recovery.  With this update, an upstream patch has been applied to address this issue. As a result, node recovery completes successfully.

Comment 6 errata-xmlrpc 2011-05-19 13:49:42 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2011-0642.html