Created attachment 470166 [details]
Description of problem:
I'm evaluating 4-node cman+pacemaker cluster. As I'm interested in working fail-over even if some nodes die, I'm testing how cluster behave if I do reboot/ipmi_poweroff on one of nodes.
I noticed that shortly after cman realizes that powered off node is absent, all dlm-related operations stuck on other nodes forever. After failed node boots again, cluster begins fencing war which could be stopped only by synchronous power off/power on of all cluster nodes.
Further digging shows, that right after failed node is correctly fenced, fenced tries to join 'fenced:default' cpg, fails with 'join error: domain default exists' message and then does 'cpg_leave fenced:default ...' and 'confchg for our leave'.
Version-Release number of selected component (if applicable):
lvm2-cluster-2.02.78-1.fc13 (built from fc15 srpm, results are the same with stock fc13 rpm)
Steps to Reproduce:
1. Setup and power on 4-node cman cluster (dlm+gfs2+clvm).
2. Power one node off via IPMI.
Shortly after powered node is fenced, 'fence_tool ls' shows nothing. dlm-related resources stuck. cluster begins fencing war after fenced node reboots.
Fenced node joins cluster back after reboot.
Cluster resources run on remaining nodes until that.
I run clvmd and one gfs2 filesystem over dlm.
dlm communicates via sctp.
gfs2 filesystem is mounted by pacemaker.
corosync uses "udpu" transport.
Can't try with 3.0.17 because of multicast problems with cisco switches (no support for udpu in 3.0.17).
Created attachment 470168 [details]
fence_tool dump from first of remaining nodes
Created attachment 470169 [details]
fence_tool dump from second of remaining nodes
Created attachment 470170 [details]
fence_tool dump from the last of remaining nodes
Forgot to mention that if I issue 'fence_tool join' (on all remaining nodes) shortly after 'fence_tool ls' becomes empty, everything works as it should.
You need to come at this from the opposite direction: assume nothing works and add things until it breaks. i.e. get rid of udpu and sctp, just start cman, and see how things work. If they work perfectly, then add or change one small thing at a time and test, and see when something breaks.
(don't add udpu or sctp until the very end when everything else works perfectly, since those are highly experimental features.)
does cman 3.1 have the udpu patches in it?
CMAN's support of enabling udpu in corosync is in 3.1.0, yes:
(In reply to comment #5)
Will try today.
> (don't add udpu or sctp until the very end when everything else works
> perfectly, since those are highly experimental features.)
Unfortunately, I hit http://marc.info/?l=openais&m=129050993731907&w=2 when I try multicast, so udpu is the only solution for me until I get rid of that assert (followed by corosync stop on all nodes). I thought that c3750-x should be stable in all layer-2 use-cases, but this seems not to be in a place (unless Dejan was right in http://marc.info/?l=openais&m=129311034019165&w=2 and this is a corosync bug). I will try to simply remove that assert and try with mcast again.
And, I'd say that I was absolutely happy with udpu in corosync until .pcmk variants of dlm_controld and gfs_controld were removed in 3.1.0 and I was forced to switch to cman to stay in sync with upstream.
As a side note, there are no visible problems with sctp (comparing to tcp), I see all expected associations, connections are established and everything just works.
Dave, could you please look again at dumps, I suspect that code path I hit should not occur - fenced tries to join already joined cpg. What can be a root of this? DLM clients (clvmd, gfs2)? dlm_controld? fenced itself? At the same moment all other parts of cluster software work fine, just waiting for fence domain, and if I immediately join nodes to fence domain by hands, everything remains stable (dlm, clvm, gfs2, pacemaker).
Anyway, I will try the following (with forced power off of node at each step after cluster stabilizes):
1) Try cman without pacemaker and udpu, mount iscsi-exported GFS2 partition without clvm.
2) Rebuild corosync with that assert removed and try again if corosync fails.
3) Switch back to udpu if corosync still fails.
4) Remove gfs2, add clvm.
5) Add gfs2 on clvm lv.
6) Switch fencing handling back to pacemaker and run it.
This all will require some time (I need to rebuild PXE live image for each step), so please stay tuned.
> Unfortunately, I hit http://marc.info/?l=openais&m=129050993731907&w=2 when I
> try multicast, so udpu is the only solution for me until I get rid of that
> assert (followed by corosync stop on all nodes)
That's due to rrp/sctp which you also need to disable.
> Dave, could you please look again at dumps, I suspect that code path I hit
> should not occur - fenced tries to join already joined cpg. What can be a root
> of this?
If turning off udpu/rrp/sctp doesn't fix things, then it sounds like a problem with the way pacemaker/cman/fencing are being integrated rather than a specific bug.
Apart from udpu, you could use broadcast. It's much better tested. However, there is a bug in cman-preconfig (which I fixed yesterday) with respect to using transport="udpb"; so you will have to enable it the old way:
<cman broadcast="yes" />
Dave, you are right, this seems to be pacemaker-specific.
I tried almost every combination, and I was able to make cluster stable in every case unless pacemaker runs.
I was able to mount GFS2 partition on iSCSI resource, then I tried clustered LV on another iSCSI resource and mounted GFS2 from it, and at every step everything run smoothly. No matter multicast, udpu or broadcast, dlm over tcp or sctp, everything just works.
pacemaker is 1.1.4-0bc69fcea1d7c69f332e60d9da84538c08ac4f3c
Andrew, could you please take a look? I can provide any additional information on request. We can also move to mailing list if you prefer.
One more strange thing: after I run pacemaker and fencing war begins, crm still shows:
Last updated: Mon Jan 10 12:00:01 2011
Current DC: v02-b - partition with quorum
4 Nodes configured, 4 expected votes
25 Resources configured.
Online: [ v02-b v02-c ]
OFFLINE: [ v02-a v02-d ]
stonith-v02-a (stonith:fence_ipmilan): Started v02-b
stonith-v02-c (stonith:fence_ipmilan): Started v02-b
stonith-v02-d (stonith:fence_ipmilan): Started v02-b
Note "partition with quorum" with only half of nodes running.
reassigning to pacemaker for Andrew to look at.
There is a big difference between being exposed by pacemaker and actually being a pacemaker bug.
Its hard to image how running "fence_tool join" would work-around a bug in pacemaker.
Andrew, I didn't say it is a pacemaker bug, but rather that it looks like a "pacemaker-specific" issue. And I simply need this all to work ;)
I should correct myself a bit, "fence_tool join" seems not to help every time. I definitely saw correct output from "cman_tool services" earlier, where all remaining nodes joined fencing domain again correctly. Now I see (I added one more node on external VM because I need to go further with this project):
On fencing master node (id 5, which didn't leave fencing domain for some reason - probably because fencing operation result didn't reached that node):
member count 4
victim count 1
victim now 6
master nodeid 5
wait state fencing
members 5 6 7 8 17
On other nodes (7, 8, 17, after "fence_tool join"):
member count 4
victim count 1
victim now 0
master nodeid 0
wait state messages
And this is everywhere:
flags 0x00000004 kern_stop
change member 5 joined 1 remove 0 failed 0 seq 5,5
members 5 6 7 8 17
new change member 4 joined 0 remove 1 failed 1 seq 6,6
new status wait_messages 0 wait_condition 1 fencing
new members 5 7 8 17
This behaviour could be related to whether fencing master and pacemaker stonith resource for the failed node reside on the same node or not - this time stonith resource was run elsewhere.
OK, so what do I need to do next to have this fixed?
(In reply to comment #14)
> Andrew, I didn't say it is a pacemaker bug, but rather that it looks like a
> "pacemaker-specific" issue. And I simply need this all to work ;)
Not everyone appreciates the difference :-)
> This behaviour could be related to whether fencing master and pacemaker stonith
> resource for the failed node reside on the same node or not - this time stonith
> resource was run elsewhere.
I was wondering if this might be the case.
David, could you comment on this?
Is it possible to notify the "wrong" instance of fenced? If so, is this the sort of result you'd expect?
I had a closer look at the Pacemaker code, it seems that every surviving node calls
rc = fenced_external(target_copy);
Is this bad?
If so, should we only be making that call on the fence leader or just any one peer?
fenced_external() is meant to be called from just the node that did the fencing, although as I look at the code I don't see why calling it from everywhere would cause a problem.
Looking back at the comments, I get the impression that some attempt is being made to correct or clean up the condition of the cluster after the problems have begun. If you want any hope of sorting this out, *don't do anything* after the node is killed, except for collecting the logs/status. Otherwise, we have no idea which data is related to the original problem, and which is related to the attempt to resolve things.
The specific data I'd like to see is, after pacemaker/stonith have finished fencing the node, and have notified fenced (called fenced_external), without doing anything else, get the output of "fence_tool ls" and "fence_tool dump" from each of the nodes.
Vladislav - any chance you could reproduce and give Dave the data he requested in comment #17?
Ahm, If I knew how to do it without altering pacemaker code to satisfy "without
doing anything else" statement, I'd try on my new testing cluster.
Or I miss something and nothing is needed to be touched in code?
If yes, then dumps already attached should be close enough to what Dave requests.
I'm almost (99%) sure that I didn't do any manual intervention while gathering that dumps. I also have dlm_tool dump and gfs_control dump from all three survived nodes if they could be needed.
Cluster where I originally saw problem is already near production state (with .pcmk controld's rebuilt from 3.0.17) and I cannot switch stack there anymore.
Damn, I already rebuilt pacemaker without support for cman, and removed all cman bits from live image I use to boot cluster nodes (testing too) via PXE, so it would require some efforts from me to put it all back.
(In reply to comment #19)
> Ahm, If I knew how to do it without altering pacemaker code to satisfy "without
> doing anything else" statement, I'd try on my new testing cluster.
> Or I miss something and nothing is needed to be touched in code?
Correct. See the end of comment #17
The specific data I'd like to see is, after pacemaker/stonith have finished
fencing the node, and have notified fenced (called fenced_external), without
doing anything else, get the output of "fence_tool ls" and "fence_tool dump"
from each of the nodes.
> If yes, then dumps already attached should be close enough to what Dave
Close enough isn't always good enough I'm afraid.
(In reply to comment #20)
> Close enough isn't always good enough I'm afraid.
I understand. Sorry, can't provide anything more valuable right now, see end of comment #19.
Someone else from the community has debugged this and provided a patch.
The fix will be included in the next update.
That patch solves this particular issue. Tested on 16-node cluster.
pacemaker-1.1.5-1.fc14 has been submitted as an update for Fedora 14.
pacemaker-1.1.5-1.fc15 has been submitted as an update for Fedora 15.
* should fix your issue,
* was pushed to the Fedora 14 testing repository,
* should be available at your local mirror within two days.
Update it with:
# su -c 'yum update --enablerepo=updates-testing pacemaker-1.1.5-1.fc14'
as soon as you are able to.
Please go to the following url:
then log in and leave karma (feedback).
pacemaker-1.1.5-1.fc15 has been pushed to the Fedora 15 stable repository. If problems still persist, please make note of it in this bug report.
pacemaker-1.1.5-1.fc14 has been pushed to the Fedora 14 stable repository. If problems still persist, please make note of it in this bug report.