Bug 801355
Summary: | cman+pacemaker leads to double fences | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 6 | Reporter: | Jaroslav Kortus <jkortus> | ||||||||
Component: | pacemaker | Assignee: | David Vossel <dvossel> | ||||||||
Status: | CLOSED ERRATA | QA Contact: | Cluster QE <mspqa-list> | ||||||||
Severity: | medium | Docs Contact: | |||||||||
Priority: | high | ||||||||||
Version: | 6.3 | CC: | abeekhof, cluster-maint, dvossel, fdinitto, mnovacek | ||||||||
Target Milestone: | rc | ||||||||||
Target Release: | --- | ||||||||||
Hardware: | Unspecified | ||||||||||
OS: | Unspecified | ||||||||||
Whiteboard: | |||||||||||
Fixed In Version: | pacemaker-1.1.8-1.el6 | Doc Type: | Bug Fix | ||||||||
Doc Text: |
Cause:
Multiple parts of the system may notice a node failure at slightly different times.
Consequence:
If more than one component requests that the node be fenced, then the fencing component will do so multiple times.
Fix:
Merge identical requests from different clients if the first is still in progress.
Result:
The node is fenced only once.
|
Story Points: | --- | ||||||||
Clone Of: | Environment: | ||||||||||
Last Closed: | 2013-02-21 09:51:03 UTC | Type: | --- | ||||||||
Regression: | --- | Mount Type: | --- | ||||||||
Documentation: | --- | CRM: | |||||||||
Verified Versions: | Category: | --- | |||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||
Embargoed: | |||||||||||
Attachments: |
|
Description
Jaroslav Kortus
2012-03-08 10:58:00 UTC
Created attachment 568590 [details]
crm_report output
output of crm_report after node03 was fenced
(In reply to comment #1) > Created attachment 568590 [details] > crm_report output > > output of crm_report after node03 was fenced I setup your environment and see problem. I'm working on tracking it down. I was able to track down the cause of this issue last week. Here is what I found. Both cman and crmd are alerted to the loss of membership of a node at the same time. This results in both of these processes scheduling the node to be fenced independently of one another. Since the remote operation to stonith to fence the node has an async response, neither cman nor crmd are capable of being aware each other have scheduled the fence operation on the same node at nearly the same time. To be clear, this is not the result of the crmd trying to fence a node because it detected it is down as a result of someone else fencing it earlier on. This is the result of both crmd and cman scheduling the fencing of a node at the same time as a result of the same event within the cluster. I have a patch that resolves this issue specifically, but now that I understand how this situation occurs I am not satisfied that it will prevent other similar situations from occurring. The correct solution to resolve this is still being discussed. NACK for 6.3, better to be strictly correct and keep the data safe. We'd likely introduce more corner cases than we solve. Rather than rush this in, we'll take our time for 6.4 and make sure we're not leaving any holes. Created attachment 654205 [details]
/var/log/messages snippet
It shows that both fenced (lines 8) and stonith-ng (line 9) fence the node.
buggy version: node01$ rpm -q pacemaker cman pacemaker-1.1.7-6.el6.x86_64 cman-3.0.12.1-32.el6.x86_64 node01$ crm status ============ Last updated: Mon Nov 26 05:53:06 2012 Last change: Mon Nov 26 05:40:06 2012 via cibadmin on c3-node01 Stack: cman Current DC: c3-node01 - partition with quorum Version: 1.1.7-6.el6-148fccfd5985c5590cc601123c6c16e966b85d14 3 Nodes configured, unknown expected votes 1 Resources configured. ============ Online: [ c3-node01 c3-node02 c3-node03 ] virt-fencing (stonith:fence_xvm): Started c3-node01 node01$ fence_node c3-node03 fence c3-node03 success ..and on the virtual host I see that there is multiple fence of that node host$ fence_virtd -F -d 2 ... Got virbr0 for interface Request 2 seqno 925456 domain c3-node03 Plain TCP request Request 2 seqno 925456 src 192.168.122.202 target c3-node03 Rebooting domain c3-node03... [REBOOT] Calling virDomainDestroy(0xef1270) Domain has been shut off Calling virDomainCreateLinux()... Request 2 seqno 88574 domain c3-node03 Plain TCP request Request 2 seqno 88574 src 192.168.122.202 target c3-node03 Rebooting domain c3-node03... [REBOOT] Calling virDomainDestroy(0xef13a0) Domain has been shut off Calling virDomainCreateLinux()... Request 2 seqno 195177 domain c3-node03 Plain TCP request Request 2 seqno 195177 src 192.168.122.202 target c3-node03 Rebooting domain c3-node03... [REBOOT] Calling virDomainDestroy(0xef15a0) Domain has been shut off Calling virDomainCreateLinux()... ----- patched version node01$ rpm -q pacemaker cman pacemaker-1.1.8-4.el6.x86_64 cman-3.0.12.1-46.el6.x86_64 node01$ crm_mon -1 Last updated: Tue Nov 27 04:37:52 2012 Last change: Mon Nov 26 08:54:17 2012 via crmd on c3-node02 Stack: cman Current DC: c3-node02 - partition with quorum Version: 1.1.8-4.el6-394e906 3 Nodes configured, unknown expected votes 1 Resources configured. Online: [ c3-node01 c3-node02 c3-node03 ] virt-fencing (stonith:fence_xvm): Started c3-node02 node01$ fence_node c3-node03 fence c3-node03 success ...and on the virtual host I see that there is fence happening only once: host$ fence_virtd -F -d 2 ... Got virbr0 for interface Request 2 seqno 463322 domain c3-node03 Plain TCP request Request 2 seqno 463322 src 192.168.122.202 target c3-node03 Rebooting domain c3-node03... [REBOOT] Calling virDomainDestroy(0x1939270) Domain has been shut off Calling virDomainCreateLinux()... (In reply to comment #9) > Created attachment 654205 [details] > /var/log/messages snippet > > It shows that both fenced (lines 8) and stonith-ng (line 9) fence the node. Not really, fenced is using the fence_pcmk device. All this does it tell stonith-ng that the node needed to be shot. So the node is only shot once, you're just seeing logs from two different subsystems regarding the same event. Created attachment 657504 [details]
/var/log/messages snippet
I got somehow confused and previously attached log from the corrected version of pacemaker. Attachement 657504 is tha one that show incorrect behaviour. David, looks like we're actually fencing the node three times. that was most likely due to fence_node command. Previously I was testing it with something that shuts down the node (pkill -9 corosync, halt -fin, panic or similar). Now it looks like that fence_node lets the cluster know about the fencing. IIRC this was not done before (or at least I could see fence_node shutting down the node and then pure cman did it once more after it realized it had lost token). (In reply to comment #14) > David, looks like we're actually fencing the node three times. I'm confused and just want to clarify this. What version of pacemaker is that log (657504) from. I would not expect that to be the behavior from the 1.1.8-4 release. -- Vossel Comment on attachment 657504 [details]
/var/log/messages snippet
This is behaviour encountered with paceameker-1.1.7-6.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHBA-2013-0375.html |