664958 – After fencing of a failed node 'fenced' daemon leaves 'fenced:default' cpg on all remaining nodes

Bug 664958 - After fencing of a failed node 'fenced' daemon leaves 'fenced:default' cpg on all remaining nodes

Summary: After fencing of a failed node 'fenced' daemon leaves 'fenced:default' cpg on...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	pacemaker
Sub Component:
Version:	13
Hardware:	x86_64
OS:	Linux
Priority:	low
Severity:	medium
Target Milestone:	---
Assignee:	Andrew Beekhof
QA Contact:	Fedora Extras Quality Assurance
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	684838
TreeView+	depends on / blocked

Reported:	2010-12-22 09:43 UTC by Vladislav Bogdanov
Modified:	2011-05-05 18:22 UTC (History)
CC List:	8 users (show)
Fixed In Version:	pacemaker-1.1.5-1.fc14
Clone Of:
Clones:	684838 (view as bug list)
Environment:
Last Closed:	2011-05-03 04:53:46 UTC
Type:	---
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
cluster.conf (1.89 KB, text/plain) 2010-12-22 09:43 UTC, Vladislav Bogdanov	no flags	Details
fence_tool dump from first of remaining nodes (8.39 KB, text/plain) 2010-12-22 09:45 UTC, Vladislav Bogdanov	no flags	Details
fence_tool dump from second of remaining nodes (6.14 KB, text/plain) 2010-12-22 09:46 UTC, Vladislav Bogdanov	no flags	Details
fence_tool dump from the last of remaining nodes (5.59 KB, text/plain) 2010-12-22 09:47 UTC, Vladislav Bogdanov	no flags	Details
View All

Description Vladislav Bogdanov 2010-12-22 09:43:58 UTC

Created attachment 470166 [details]
cluster.conf

Description of problem:
I'm evaluating 4-node cman+pacemaker cluster. As I'm interested in working fail-over even if some nodes die, I'm testing how cluster behave if I do reboot/ipmi_poweroff on one of nodes.
I noticed that shortly after cman realizes that powered off node is absent, all dlm-related operations stuck on other nodes forever. After failed node boots again, cluster begins fencing war which could be stopped only by synchronous power off/power on of all cluster nodes.
Further digging shows, that right after failed node is correctly fenced, fenced tries to join 'fenced:default' cpg, fails with 'join error: domain default exists' message and then does 'cpg_leave fenced:default ...' and 'confchg for our leave'.

Version-Release number of selected component (if applicable):
kernel-2.6.34.7-63.fc13.x86_64
corosync-1.3.0-1.fc13
openais-1.1.4-1.fc13
cman-3.1.0-6.fc13
lvm2-cluster-2.02.78-1.fc13 (built from fc15 srpm, results are the same with stock fc13 rpm)
gfs2-cluster-3.1.0-3.fc13
pacemaker-1.1.4-4.fc13

How reproducible:
100%

Steps to Reproduce:
1. Setup and power on 4-node cman cluster (dlm+gfs2+clvm).
2. Power one node off via IPMI.
  
Actual results:
Shortly after powered node is fenced, 'fence_tool ls' shows nothing. dlm-related resources stuck. cluster begins fencing war after fenced node reboots.

Expected results:
Fenced node joins cluster back after reboot.
Cluster resources run on remaining nodes until that.

Additional info:
I run clvmd and one gfs2 filesystem over dlm.
dlm communicates via sctp.
gfs2 filesystem is mounted by pacemaker.
corosync uses "udpu" transport.
Can't try with 3.0.17 because of multicast problems with cisco switches (no support for udpu in 3.0.17).

Comment 1 Vladislav Bogdanov 2010-12-22 09:45:12 UTC

Created attachment 470168 [details]
fence_tool dump from first of remaining nodes

Comment 2 Vladislav Bogdanov 2010-12-22 09:46:32 UTC

Created attachment 470169 [details]
fence_tool dump from second of remaining nodes

Comment 3 Vladislav Bogdanov 2010-12-22 09:47:32 UTC

Created attachment 470170 [details]
fence_tool dump from the last of remaining nodes

Comment 4 Vladislav Bogdanov 2010-12-22 10:04:45 UTC

Forgot to mention that if I issue 'fence_tool join' (on all remaining nodes) shortly after 'fence_tool ls' becomes empty, everything works as it should.

Comment 5 David Teigland 2011-01-03 16:33:17 UTC

You need to come at this from the opposite direction: assume nothing works and add things until it breaks.  i.e. get rid of udpu and sctp, just start cman, and see how things work.  If they work perfectly, then add or change one small thing at a time and test, and see when something breaks.

(don't add udpu or sctp until the very end when everything else works perfectly, since those are highly experimental features.)

Comment 6 Steven Dake 2011-01-03 16:44:16 UTC

Fabio,

does cman 3.1 have the udpu patches in it?

Comment 7 Lon Hohberger 2011-01-03 16:56:23 UTC

CMAN's support of enabling udpu in corosync is in 3.1.0, yes:

http://git.fedorahosted.org/git?p=cluster.git;a=commit;h=c399f4c0f0d7cc4467a68c41715c404b2afb9425

Comment 8 Vladislav Bogdanov 2011-01-04 08:10:55 UTC

(In reply to comment #5)
Will try today. 

> (don't add udpu or sctp until the very end when everything else works
> perfectly, since those are highly experimental features.)

Unfortunately, I hit http://marc.info/?l=openais&m=129050993731907&w=2 when I try multicast, so udpu is the only solution for me until I get rid of that assert (followed by corosync stop on all nodes). I thought that c3750-x should be stable in all layer-2 use-cases, but this seems not to be in a place (unless Dejan was right in http://marc.info/?l=openais&m=129311034019165&w=2 and this is a corosync bug). I will try to simply remove that assert and try with mcast again.

And, I'd say that I was absolutely happy with udpu in corosync until .pcmk variants of dlm_controld and gfs_controld were removed in 3.1.0 and I was forced to switch to cman to stay in sync with upstream.

As a side note, there are no visible problems with sctp (comparing to tcp), I see all expected associations, connections are established and everything just works.

Dave, could you please look again at dumps, I suspect that code path I hit should not occur - fenced tries to join already joined cpg. What can be a root of this? DLM clients (clvmd, gfs2)? dlm_controld? fenced itself? At the same moment all other parts of cluster software work fine, just waiting for fence domain, and if I immediately join nodes to fence domain by hands, everything remains stable (dlm, clvm, gfs2, pacemaker).

Anyway, I will try the following (with forced power off of node at each step after cluster stabilizes):
1) Try cman without pacemaker and udpu, mount iscsi-exported GFS2 partition without clvm.
2) Rebuild corosync with that assert removed and try again if corosync fails.
3) Switch back to udpu if corosync still fails.
4) Remove gfs2, add clvm.
5) Add gfs2 on clvm lv.
6) Switch fencing handling back to pacemaker and run it.

This all will require some time (I need to rebuild PXE live image for each step), so please stay tuned.

Comment 9 David Teigland 2011-01-04 15:42:41 UTC

> Unfortunately, I hit http://marc.info/?l=openais&m=129050993731907&w=2 when I
> try multicast, so udpu is the only solution for me until I get rid of that
> assert (followed by corosync stop on all nodes)

That's due to rrp/sctp which you also need to disable.

> Dave, could you please look again at dumps, I suspect that code path I hit
> should not occur - fenced tries to join already joined cpg. What can be a root
> of this?

If turning off udpu/rrp/sctp doesn't fix things, then it sounds like a problem with the way pacemaker/cman/fencing are being integrated rather than a specific bug.

Comment 10 Lon Hohberger 2011-01-04 16:05:49 UTC

Apart from udpu, you could use broadcast.  It's much better tested.  However, there is a bug in cman-preconfig (which I fixed yesterday) with respect to using transport="udpb"; so you will have to enable it the old way:

   <cman broadcast="yes" />

Comment 11 Vladislav Bogdanov 2011-01-10 12:17:19 UTC

Dave, you are right, this seems to be pacemaker-specific.

I tried almost every combination, and I was able to make cluster stable in every case unless pacemaker runs.
I was able to mount GFS2 partition on iSCSI resource, then I tried clustered LV on another iSCSI resource and mounted GFS2 from it, and at every step everything run smoothly. No matter multicast, udpu or broadcast, dlm over tcp or sctp, everything just works.

pacemaker is 1.1.4-0bc69fcea1d7c69f332e60d9da84538c08ac4f3c

Andrew, could you please take a look? I can provide any additional information on request. We can also move to mailing list if you prefer.

One more strange thing: after I run pacemaker and fencing war begins, crm still shows:
-------------
Last updated: Mon Jan 10 12:00:01 2011
Stack: cman
Current DC: v02-b - partition with quorum
Version: 1.1.4-0bc69fcea1d7c69f332e60d9da84538c08ac4f3c
4 Nodes configured, 4 expected votes
25 Resources configured.
============

Online: [ v02-b v02-c ]
OFFLINE: [ v02-a v02-d ]

 stonith-v02-a	(stonith:fence_ipmilan):	Started v02-b
 stonith-v02-c	(stonith:fence_ipmilan):	Started v02-b
 stonith-v02-d	(stonith:fence_ipmilan):	Started v02-b
-------------

Note "partition with quorum" with only half of nodes running.

Comment 12 Fabio Massimo Di Nitto 2011-01-10 12:22:31 UTC

reassigning to pacemaker for Andrew to look at.

Comment 13 Andrew Beekhof 2011-01-11 07:26:21 UTC

There is a big difference between being exposed by pacemaker and actually being a pacemaker bug.

Its hard to image how running "fence_tool join" would work-around a bug in pacemaker.

Comment 14 Vladislav Bogdanov 2011-01-11 08:46:50 UTC

Andrew, I didn't say it is a pacemaker bug, but rather that it looks like a "pacemaker-specific" issue. And I simply need this all to work ;)

I should correct myself a bit, "fence_tool join" seems not to help every time. I definitely saw correct output from "cman_tool services" earlier, where all remaining nodes joined fencing domain again correctly. Now I see (I added one more node on external VM because I need to go further with this project):
On fencing master node (id 5, which didn't leave fencing domain for some reason - probably because fencing operation result didn't reached that node):
fence domain
member count  4
victim count  1
victim now    6
master nodeid 5
wait state    fencing
members       5 6 7 8 17 

On other nodes (7, 8, 17, after "fence_tool join"):
fence domain
member count  4
victim count  1
victim now    0
master nodeid 0
wait state    messages
members       

And this is everywhere:
name          clvmd
id            0x4104eefa
flags         0x00000004 kern_stop
change        member 5 joined 1 remove 0 failed 0 seq 5,5
members       5 6 7 8 17 
new change    member 4 joined 0 remove 1 failed 1 seq 6,6
new status    wait_messages 0 wait_condition 1 fencing
new members   5 7 8 17 

This behaviour could be related to whether fencing master and pacemaker stonith resource for the failed node reside on the same node or not - this time stonith resource was run elsewhere.

OK, so what do I need to do next to have this fixed?

Comment 15 Andrew Beekhof 2011-01-14 10:21:46 UTC

(In reply to comment #14)
> Andrew, I didn't say it is a pacemaker bug, but rather that it looks like a
> "pacemaker-specific" issue. And I simply need this all to work ;)

Not everyone appreciates the difference :-)

> This behaviour could be related to whether fencing master and pacemaker stonith
> resource for the failed node reside on the same node or not - this time stonith
> resource was run elsewhere.

I was wondering if this might be the case.

David, could you comment on this?
Is it possible to notify the "wrong" instance of fenced? If so, is this the sort of result you'd expect?

Comment 16 Andrew Beekhof 2011-01-14 10:44:25 UTC

I had a closer look at the Pacemaker code, it seems that every surviving node calls

	    rc = fenced_external(target_copy);

Is this bad?
If so, should we only be making that call on the fence leader or just any one peer?

Comment 17 David Teigland 2011-01-14 16:43:33 UTC

fenced_external() is meant to be called from just the node that did the fencing, although as I look at the code I don't see why calling it from everywhere would cause a problem.

Looking back at the comments, I get the impression that some attempt is being made to correct or clean up the condition of the cluster after the problems have begun.  If you want any hope of sorting this out, *don't do anything* after the node is killed, except for collecting the logs/status.  Otherwise, we have no idea which data is related to the original problem, and which is related to the attempt to resolve things.

The specific data I'd like to see is, after pacemaker/stonith have finished fencing the node, and have notified fenced (called fenced_external), without doing anything else, get the output of "fence_tool ls" and "fence_tool dump" from each of the nodes.

Comment 18 Andrew Beekhof 2011-02-22 12:25:43 UTC

Vladislav - any chance you could reproduce and give Dave the data he requested in comment #17?

Comment 19 Vladislav Bogdanov 2011-02-22 12:44:19 UTC

Ahm, If I knew how to do it without altering pacemaker code to satisfy "without
doing anything else" statement, I'd try on my new testing cluster.

Or I miss something and nothing is needed to be touched in code?

If yes, then dumps already attached should be close enough to what Dave requests.
I'm almost (99%) sure that I didn't do any manual intervention while gathering that dumps. I also have dlm_tool dump and gfs_control dump from all three survived nodes if they could be needed.

Cluster where I originally saw problem is already near production state (with .pcmk controld's rebuilt from 3.0.17) and I cannot switch stack there anymore.

Damn, I already rebuilt pacemaker without support for cman, and removed all cman bits from live image I use to boot cluster nodes (testing too) via PXE, so it would require some efforts from me to put it all back.

Comment 20 Andrew Beekhof 2011-02-22 14:16:29 UTC

(In reply to comment #19)
> Ahm, If I knew how to do it without altering pacemaker code to satisfy "without
> doing anything else" statement, I'd try on my new testing cluster.
> 
> Or I miss something and nothing is needed to be touched in code?

Correct. See the end of comment #17

[quote]
The specific data I'd like to see is, after pacemaker/stonith have finished
fencing the node, and have notified fenced (called fenced_external), without
doing anything else, get the output of "fence_tool ls" and "fence_tool dump"
from each of the nodes.
[/quote]

> If yes, then dumps already attached should be close enough to what Dave
> requests.

Close enough isn't always good enough I'm afraid.

Comment 21 Vladislav Bogdanov 2011-02-22 14:32:46 UTC

(In reply to comment #20)
> Close enough isn't always good enough I'm afraid.

I understand. Sorry, can't provide anything more valuable right now, see end of comment #19.

Comment 22 Andrew Beekhof 2011-03-14 16:15:09 UTC

Someone else from the community has debugged this and provided a patch.
The fix will be included in the next update.

Comment 23 Vladislav Bogdanov 2011-03-19 06:45:48 UTC

That patch solves this particular issue. Tested on 16-node cluster.

Comment 24 Fedora Update System 2011-04-27 11:11:35 UTC

pacemaker-1.1.5-1.fc14 has been submitted as an update for Fedora 14.
https://admin.fedoraproject.org/updates/pacemaker-1.1.5-1.fc14

Comment 25 Fedora Update System 2011-04-27 11:14:32 UTC

pacemaker-1.1.5-1.fc15 has been submitted as an update for Fedora 15.
https://admin.fedoraproject.org/updates/pacemaker-1.1.5-1.fc15

Comment 26 Fedora Update System 2011-04-28 01:57:53 UTC

Package pacemaker-1.1.5-1.fc14:
* should fix your issue,
* was pushed to the Fedora 14 testing repository,
* should be available at your local mirror within two days.
Update it with:
# su -c 'yum update --enablerepo=updates-testing pacemaker-1.1.5-1.fc14'
as soon as you are able to.
Please go to the following url:
https://admin.fedoraproject.org/updates/pacemaker-1.1.5-1.fc14
then log in and leave karma (feedback).

Comment 27 Fedora Update System 2011-05-03 04:53:39 UTC

pacemaker-1.1.5-1.fc15 has been pushed to the Fedora 15 stable repository.  If problems still persist, please make note of it in this bug report.

Comment 28 Fedora Update System 2011-05-05 18:22:00 UTC

pacemaker-1.1.5-1.fc14 has been pushed to the Fedora 14 stable repository.  If problems still persist, please make note of it in this bug report.

Note You need to log in before you can comment on or make changes to this bug.