Bug 880249
| Summary: | Deleting Master/slave set results in node fence | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux 6 | Reporter: | Jaroslav Kortus <jkortus> | ||||||||
| Component: | pacemaker | Assignee: | Andrew Beekhof <abeekhof> | ||||||||
| Status: | CLOSED ERRATA | QA Contact: | Cluster QE <mspqa-list> | ||||||||
| Severity: | high | Docs Contact: | |||||||||
| Priority: | high | ||||||||||
| Version: | 6.4 | CC: | cluster-maint, djansa, dvossel, fdinitto, tlavigne | ||||||||
| Target Milestone: | rc | ||||||||||
| Target Release: | --- | ||||||||||
| Hardware: | Unspecified | ||||||||||
| OS: | Unspecified | ||||||||||
| Whiteboard: | |||||||||||
| Fixed In Version: | pacemaker-1.1.8-7.el6 | Doc Type: | Bug Fix | ||||||||
| Doc Text: | Story Points: | --- | |||||||||
| Clone Of: | Environment: | ||||||||||
| Last Closed: | 2013-02-21 09:51:27 UTC | Type: | Bug | ||||||||
| Regression: | --- | Mount Type: | --- | ||||||||
| Documentation: | --- | CRM: | |||||||||
| Verified Versions: | Category: | --- | |||||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||||
| Embargoed: | |||||||||||
| Bug Depends On: | 893221 | ||||||||||
| Bug Blocks: | 768522, 895654 | ||||||||||
| Attachments: |
|
||||||||||
Ok, we'll look into it. If you still have the cluster set up, could you run crm_report for the time when the test was run? Created attachment 661533 [details] crm_report output the output may be incomplete due to https://bugzilla.redhat.com/show_bug.cgi?id=886153. The logs are incomplete. We really need to see what is happening on node2. The logs only show node1. In this case node2 is the dc, so it will have all the pengine and transition information that gives us visibility into why the fencing operation occurred. David managed to reproduce this but it took him a few goes - it doesn't happen every time. Its pretty clear from the output below (and I've confirmed by looking at the .dot file), there is no ordering between the stop and demote operations. This can lead to the stop actions failing for which the correct recovery is fencing. I should be able to fix this shortly. Definitely a blocker. # tools/crm_simulate -Sx ~/rhbz880249/pe-error-0.bz2 -D foo.dot Current cluster status: Online: [ 18node1 18node2 18node3 ] shoot1 (stonith:fence_xvm): Started 18node1 shoot2 (stonith:fence_xvm): Started 18node2 dummystateful (ocf::pacemaker:Stateful ORPHANED): Master [ 18node2 18node1 18node3 ] Transition Summary: * Demote dummystateful (Master -> Stopped 18node2) Executing cluster transition: * Resource action: dummystateful stop on 18node3 * Resource action: dummystateful stop on 18node1 * Resource action: dummystateful stop on 18node2 * Resource action: dummystateful demote on 18node3 * Resource action: dummystateful demote on 18node1 * Resource action: dummystateful demote on 18node2 * Pseudo action: all_stopped Revised cluster status: Online: [ 18node1 18node2 18node3 ] shoot1 (stonith:fence_xvm): Started 18node1 shoot2 (stonith:fence_xvm): Started 18node2 dummystateful (ocf::pacemaker:Stateful ORPHANED): Slave [ 18node2 18node1 18node3 ] (In reply to comment #5) > Created attachment 661533 [details] > crm_report output How did you run crm_report for this? It only contains the details for one of the nodes. A related patch has been committed upstream: https://github.com/beekhof/pacemaker/commit/c20ad90 with subject: High: PE: Bug rhbz#880249 - Ensure orphan masters are demoted before being stopped Further details (if any): A related patch has been committed upstream: https://github.com/beekhof/pacemaker/commit/19484a4 with subject: High: PE: Bug rhbz#880249 - Teach the PE how to recover masters into primitives Further details (if any): If a master/slave is replaced with a primitive before the old status entries are cleaned up, the PE needs to be able to get resources from the Master state to the Started state sanely. All good now. Regression test added: +do_test bug-rh-880249 "Handle replacement of an m/s resource with a primitive" # tools/crm_simulate -Sx ~/rhbz880249/pe-error-1.bz2 -D foo.dot Current cluster status: Online: [ 18node1 18node2 18node3 ] shoot1 (stonith:fence_xvm): Started 18node1 shoot2 (stonith:fence_xvm): Started 18node2 dummystateful (ocf::pacemaker:Stateful): Master [ 18node2 18node1 18node3 ] Transition Summary: * Demote dummystateful (Master -> Started 18node2) * Restart dummystateful (Master 18node3) * Move dummystateful (Started 18node2 -> 18node3) Executing cluster transition: * Resource action: dummystateful demote on 18node3 * Resource action: dummystateful demote on 18node1 * Resource action: dummystateful demote on 18node2 * Resource action: dummystateful stop on 18node3 * Resource action: dummystateful stop on 18node1 * Resource action: dummystateful stop on 18node2 * Pseudo action: all_stopped * Resource action: dummystateful start on 18node3 Revised cluster status: Online: [ 18node1 18node2 18node3 ] shoot1 (stonith:fence_xvm): Started 18node1 shoot2 (stonith:fence_xvm): Started 18node2 dummystateful (ocf::pacemaker:Stateful): Started 18node3 ad comment 8, I ran crm_report -f "<timespec>" --nodes "<space separated node names">. Supplied ssh password when required. (In reply to comment #12) > ad comment 8, I ran crm_report -f "<timespec>" --nodes "<space separated > node names">. Supplied ssh password when required. I believe this is the command I ran. crm_report --cluster corosync --nodes '18node1 18node2 18node3' -f "2012-12-13 12:30:00" Minus the cluster type, it is the same as yours. I do have ssh keys on all my nodes, so I'm never prompted for a password. That might help. We need to get this worked out for you though. Let me know if you need help debugging what's going on. I believe we have a few hours that overlap on irc. -- Vossel (In reply to comment #13) > We need to get this worked out for you though. Agreed. Jaroslav, are you running from within the cluster or from another machine? The missing bits are due to bug 886151 (I've installed the dependency on first node only, the rest did not visibly complain). I'm still seeing the unwanted fencing behaviour with plain dummystateful resource. Scenario is as follows: 1. setup 3-node pacemaker cluster 2. on node2: pcs resource create dummystateful ocf:pacemaker:Stateful; sleep 10; pcs resource delete dummystateful; sleep 60; pcs resource create dummystateful ocf:pacemaker:Stateful 3. node2 gets fenced in 1-2 minutes pacemaker-1.1.8-7.el6.x86_64. Moving back to ASSIGNED, please kill this bug as well :). Created attachment 676988 [details] crm_report output of fence on dummystateful resource crm_report collected during test in comment 17. (In reply to comment #17) > I'm still seeing the unwanted fencing behaviour with plain dummystateful > resource. > > Scenario is as follows: > 1. setup 3-node pacemaker cluster > 2. on node2: pcs resource create dummystateful ocf:pacemaker:Stateful; sleep > 10; pcs resource delete dummystateful; sleep 60; pcs resource create > dummystateful ocf:pacemaker:Stateful > 3. node2 gets fenced in 1-2 minutes The above commands don't make a master/slave resource, the create a instance of the Stateful resource that is treated like a normal resource (no promote/demote actions take place). Looking at the crm_report, this looks very similar to the problem I experienced recently in issue 893221. Take a look at this comment. https://bugzilla.redhat.com/show_bug.cgi?id=893221#c3 My results show the stop action failing with nearly the exact same pcs commands you used. Removing the pcs call to clear the resource from the lrmd on deletion fixed this. Can you try the current upstream version of pcs, or any version of pcs with the patch given in issue 893221 to verify these two issues are related? -- Vossel Moving ON_QA. This bug will require pcs fixes that are already targeted for SNAP4, but otherwise there are no code changes for pacemaker. I had this small script:
#!/bin/bash
pcs resource create dummystateful ocf:pacemaker:Stateful
sleep 5
pcs resource master MasterResource dummystateful
sleep 5
pcs resource delete MasterResource
pcs resource create dummystateful ocf:pacemaker:Stateful
And after the last step is finished, then about one minute later (while the newly created resource is still in Stopped instead of Started), the following appears:
Failed actions:
dummystateful_demote_0 (node=marathon-03c2-node03, call=-1, rc=1, status=Timed Out): unknown error
dummystateful_demote_0 (node=marathon-03c2-node02, call=-1, rc=1, status=Timed Out): unknown error
dummystateful_demote_0 (node=marathon-03c2-node01, call=-1, rc=1, status=Timed Out): unknown error
Can you please confirm that this is related to this bug or bug 902459?. Good news is that it's no longer fencing the node and the resource eventually starts (but increases failcount for the nodes).
Created attachment 685178 [details] crm_report for issue in comment 21 This appears related to https://bugzilla.redhat.com/show_bug.cgi?id=893221 In 893221 pcs was calling crm_resource -C -r immediately after deleting the resource from the cib. This causes some problems which are outlined in this comment, https://bugzilla.redhat.com/show_bug.cgi?id=893221#c3. Issue 893221 fixed the problem for all resource types expect for Master/Slave, which is what you are encountering here. Apparently there is a separate code path being used to delete Master/Slave resources in pcs compared to everything else. To verify this issue using the current version of pcs, you can by-pass the issue by using a file which will not call 'crm_resource -C' during the deletion. The script below should work. --------- #!/bin/bash pcs resource create dummystateful ocf:pacemaker:Stateful sleep 5 pcs resource master MasterResource dummystateful sleep 5 pcs cluster cib cib_file.xml pcs -f cib_file.xml resource delete MasterResource pcs -f cib_file.xml resource create dummystateful ocf:pacemaker:Stateful pcs cluster push cib cib_file.xml -- Vossel I'm no longer able to reproduce the issue. Issue in comment 21 was indeed missing pcs patch (included in 0.9.26-10). Thank you for fixing this bug. Marking as verified with: pcs-0.9.26-10.el6.noarch pacemaker-1.1.8-7.el6.x86_64 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHBA-2013-0375.html |
Description of problem: Deleting master/slave resource results in one of the nodes (elected master node for the set) to be fenced. Version-Release number of selected component (if applicable): How reproducible: 100% Steps to Reproduce: 1. pcs resource create dummystateful ocf:pacemaker:Stateful 2. pcs resource master MasterResource dummystateful 3. pcs resource delete MasterResource Actual results: * resource deleted * failures in status report * one of the nodes is fenced Expected results: * resource deleted as expected * no node fenced Additional info: pcs status info at the time of the fence action: Failed actions: dummystateful_stop_0 (node=marathon-03c1-node01, call=33, rc=8, status=complete): master dummystateful_demote_0 (node=marathon-03c1-node03, call=23, rc=7, status=complete): not running dummystateful_demote_0 (node=marathon-03c1-node02, call=34, rc=7, status=complete): not running