Hide Forgot
Description of problem: cman with fence_pcmk + pacemaker (in this case with virt fencing) fence the node twice if it fails. This is most probably due to cman requesting fence and pacemaker making it's fencing decision right after that. Version-Release number of selected component (if applicable): pacemaker-1.1.7-2.el6.x86_64 cman-3.0.12.1-27.el6.x86_64 How reproducible: 100% Steps to Reproduce: 1. setup cman+pacemaker cluster with fence_pcmk and (for example) virt fencing 2. ssh to one node and issue halt -fin 3. wait for the node to get fenced, observe it gets fenced twice in a row Actual results: node fenced twice Expected results: one fence Additional info: config snips: cluster.conf: <clusternode name="node01" nodeid="1" votes="1"> <fence> <method name="pcmk-redirect"> <device name="pcmk" port="node01"/> </method> </fence> </clusternode> <fencedevices> <fencedevice agent="fence_pcmk" name="pcmk"/> </fencedevices> crm: primitive virt-fencing stonith:fence_xvm \ params pcmk_host_check="static-list" pcmk_host_list="node01,node02,node03" action="reboot" debug="1" you can see the double fence via virsh console or via manually running fence_virtd (fence_virtd -F -f /etc/fence_virt.conf -d 2): Request 2 seqno 367654 domain node03 Plain TCP request Request 2 seqno 367654 src 192.168.100.101 target node03 Rebooting domain node03... [REBOOT] Calling virDomainDestroy(0x233cc30) Domain has been shut off Calling virDomainCreateLinux()... Request 2 seqno 623812 domain node03 Plain TCP request Request 2 seqno 623812 src 192.168.100.101 target node03 Rebooting domain node03... [REBOOT] Calling virDomainDestroy(0x233cd60) Domain has been shut off Calling virDomainCreateLinux()...
Created attachment 568590 [details] crm_report output output of crm_report after node03 was fenced
(In reply to comment #1) > Created attachment 568590 [details] > crm_report output > > output of crm_report after node03 was fenced I setup your environment and see problem. I'm working on tracking it down.
I was able to track down the cause of this issue last week. Here is what I found. Both cman and crmd are alerted to the loss of membership of a node at the same time. This results in both of these processes scheduling the node to be fenced independently of one another. Since the remote operation to stonith to fence the node has an async response, neither cman nor crmd are capable of being aware each other have scheduled the fence operation on the same node at nearly the same time. To be clear, this is not the result of the crmd trying to fence a node because it detected it is down as a result of someone else fencing it earlier on. This is the result of both crmd and cman scheduling the fencing of a node at the same time as a result of the same event within the cluster. I have a patch that resolves this issue specifically, but now that I understand how this situation occurs I am not satisfied that it will prevent other similar situations from occurring. The correct solution to resolve this is still being discussed.
NACK for 6.3, better to be strictly correct and keep the data safe. We'd likely introduce more corner cases than we solve. Rather than rush this in, we'll take our time for 6.4 and make sure we're not leaving any holes.
Created attachment 654205 [details] /var/log/messages snippet It shows that both fenced (lines 8) and stonith-ng (line 9) fence the node.
buggy version: node01$ rpm -q pacemaker cman pacemaker-1.1.7-6.el6.x86_64 cman-3.0.12.1-32.el6.x86_64 node01$ crm status ============ Last updated: Mon Nov 26 05:53:06 2012 Last change: Mon Nov 26 05:40:06 2012 via cibadmin on c3-node01 Stack: cman Current DC: c3-node01 - partition with quorum Version: 1.1.7-6.el6-148fccfd5985c5590cc601123c6c16e966b85d14 3 Nodes configured, unknown expected votes 1 Resources configured. ============ Online: [ c3-node01 c3-node02 c3-node03 ] virt-fencing (stonith:fence_xvm): Started c3-node01 node01$ fence_node c3-node03 fence c3-node03 success ..and on the virtual host I see that there is multiple fence of that node host$ fence_virtd -F -d 2 ... Got virbr0 for interface Request 2 seqno 925456 domain c3-node03 Plain TCP request Request 2 seqno 925456 src 192.168.122.202 target c3-node03 Rebooting domain c3-node03... [REBOOT] Calling virDomainDestroy(0xef1270) Domain has been shut off Calling virDomainCreateLinux()... Request 2 seqno 88574 domain c3-node03 Plain TCP request Request 2 seqno 88574 src 192.168.122.202 target c3-node03 Rebooting domain c3-node03... [REBOOT] Calling virDomainDestroy(0xef13a0) Domain has been shut off Calling virDomainCreateLinux()... Request 2 seqno 195177 domain c3-node03 Plain TCP request Request 2 seqno 195177 src 192.168.122.202 target c3-node03 Rebooting domain c3-node03... [REBOOT] Calling virDomainDestroy(0xef15a0) Domain has been shut off Calling virDomainCreateLinux()... ----- patched version node01$ rpm -q pacemaker cman pacemaker-1.1.8-4.el6.x86_64 cman-3.0.12.1-46.el6.x86_64 node01$ crm_mon -1 Last updated: Tue Nov 27 04:37:52 2012 Last change: Mon Nov 26 08:54:17 2012 via crmd on c3-node02 Stack: cman Current DC: c3-node02 - partition with quorum Version: 1.1.8-4.el6-394e906 3 Nodes configured, unknown expected votes 1 Resources configured. Online: [ c3-node01 c3-node02 c3-node03 ] virt-fencing (stonith:fence_xvm): Started c3-node02 node01$ fence_node c3-node03 fence c3-node03 success ...and on the virtual host I see that there is fence happening only once: host$ fence_virtd -F -d 2 ... Got virbr0 for interface Request 2 seqno 463322 domain c3-node03 Plain TCP request Request 2 seqno 463322 src 192.168.122.202 target c3-node03 Rebooting domain c3-node03... [REBOOT] Calling virDomainDestroy(0x1939270) Domain has been shut off Calling virDomainCreateLinux()...
(In reply to comment #9) > Created attachment 654205 [details] > /var/log/messages snippet > > It shows that both fenced (lines 8) and stonith-ng (line 9) fence the node. Not really, fenced is using the fence_pcmk device. All this does it tell stonith-ng that the node needed to be shot. So the node is only shot once, you're just seeing logs from two different subsystems regarding the same event.
Created attachment 657504 [details] /var/log/messages snippet
I got somehow confused and previously attached log from the corrected version of pacemaker. Attachement 657504 is tha one that show incorrect behaviour.
David, looks like we're actually fencing the node three times.
that was most likely due to fence_node command. Previously I was testing it with something that shuts down the node (pkill -9 corosync, halt -fin, panic or similar). Now it looks like that fence_node lets the cluster know about the fencing. IIRC this was not done before (or at least I could see fence_node shutting down the node and then pure cman did it once more after it realized it had lost token).
(In reply to comment #14) > David, looks like we're actually fencing the node three times. I'm confused and just want to clarify this. What version of pacemaker is that log (657504) from. I would not expect that to be the behavior from the 1.1.8-4 release. -- Vossel
Comment on attachment 657504 [details] /var/log/messages snippet This is behaviour encountered with paceameker-1.1.7-6.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHBA-2013-0375.html