Bug 893221
Summary: | pcs should not delete a resource from the lrm during resource removal | ||
---|---|---|---|
Product: | Red Hat Enterprise Linux 6 | Reporter: | Chris Feist <cfeist> |
Component: | pcs | Assignee: | Chris Feist <cfeist> |
Status: | CLOSED ERRATA | QA Contact: | Cluster QE <mspqa-list> |
Severity: | high | Docs Contact: | |
Priority: | high | ||
Version: | 6.4 | CC: | abeekhof, cluster-maint, dvossel, fdanapfe, jkortus, lhh, rsteiger, tlavigne |
Target Milestone: | rc | ||
Target Release: | --- | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | pcs-0.9.26-10.el6 | Doc Type: | Bug Fix |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2013-02-21 09:49:54 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | |||
Bug Blocks: | 768522, 880249, 902691 |
Description
Chris Feist
2013-01-08 23:19:58 UTC
This can cause some rather confusing problems with pacemaker. Depending on the timing, this can make it look like the stop action for a resource fails after it is removed from the CIB. After the stop action is reported to have failed, pacemaker treats that resource as unmanaged and bad things happen. Here's what I think is going on... The source of the problem is that crm_resource -C is removing the resource entry from both the lrmd and crmd cache after the cib entry has been removed but before the deleted resource's stop action completes. When the successful stop action for the resource is reported back to the crmd, it gets ignored because it doesn't recognize the resource exists. Because of this the cib update containing the successful stop operation never goes out. Eventually the stop action's timer in the crmd pops and the action is treated as a failure due to timeout. At some point pacemaker needs to address this condition, but in reality pcs should not be both deleting the resource from the cib and the lrmd. Pacemaker should clean everything up properly when the resource is deleted from the cib. I can confirm this patch to pcs resolves the issue I have outlined above. Here are my test results. -------- WITHOUT PCS PATCH (A FAILS ON STOP ACTION)--------------- # rm -f /var/lib/pacemaker/cib/* # rm -f /var/lib/pacemaker/pengine/* # service corosync start # service pacemaker start # sleep 60 # pcs property set stonith-enabled=false # pcs property set no-quorum-policy=ignore # pcs resource create A ocf:pacemaker:Dummy op monitor interval=10s # sleep 5 # pcs resource delete A Deleting Resource - A # sleep 120 # crm_mon -1 Last updated: Thu Jan 10 18:47:33 2013 Last change: Thu Jan 10 18:45:33 2013 via crmd on 18builder Stack: corosync Current DC: 18builder (4) - partition WITHOUT quorum Version: 1.1.8-6f6a7fd 4 Nodes configured, unknown expected votes 0 Resources configured. Online: [ 18builder ] OFFLINE: [ 18node1 18node2 18node3 ] A (ocf::pacemaker:Dummy): ORPHANED Started 18builder (unmanaged) FAILED Failed actions: A_stop_0 (node=18builder, call=-1, rc=1, status=Timed Out): unknown error ---------- WITH PCS PATCH (NO FAILURES) ----------------- # rm -f /var/lib/pacemaker/cib/* # rm -f /var/lib/pacemaker/pengine/* # service corosync start # service pacemaker start # sleep 60 # pcs property set stonith-enabled=false # pcs property set no-quorum-policy=ignore # pcs resource create A ocf:pacemaker:Dummy op monitor interval=10s # sleep 5 # pcs resource delete A Deleting Resource - A # crm_mon -1 Last updated: Thu Jan 10 18:52:54 2013 Last change: Thu Jan 10 18:50:54 2013 via cibadmin on 18builder Stack: corosync Current DC: 18builder (4) - partition WITHOUT quorum Version: 1.1.8-6f6a7fd 4 Nodes configured, unknown expected votes 0 Resources configured. Online: [ 18builder ] OFFLINE: [ 18node1 18node2 18node3 ] Before fix: [root@ask-03 ~]# rm -f /var/lib/pacemaker/cib/* [root@ask-03 ~]# rm -f /var/lib/pacemaker/pengine/* [root@ask-03 ~]# service corosync start Starting Corosync Cluster Engine (corosync): [ OK ] [root@ask-03 ~]# service pacemaker start ... [root@ask-03 ~]# pcs property set stonith-enabled=false [root@ask-03 ~]# pcs property set no-quorum-policy=ignore [root@ask-03 ~]# pcs resource create A ocf:pacemaker:Dummy op monitor interval=10s [root@ask-03 ~]# sleep 5 [root@ask-03 ~]# pcs resource delete A Deleting Resource - A [root@ask-03 ~]# sleep 120 [root@ask-03 ~]# crm_mon -1 Last updated: Tue Jan 15 15:40:35 2013 Last change: Tue Jan 15 15:38:31 2013 via crmd on ask-03 Stack: classic openais (with plugin) Current DC: ask-03 - partition WITHOUT quorum Version: 1.1.8-7.el6-394e906 1 Nodes configured, 2 expected votes 0 Resources configured. Online: [ ask-03 ] A (ocf::pacemaker:Dummy): ORPHANED Started ask-03 (unmanaged) FAILED Failed actions: A_stop_0 (node=ask-03, call=-1, rc=1, status=Timed Out): unknown error After Fix: Make sure the cluster is shutdown ('killall crmd' was necessary before 'pcs cluster stop') [root@ask-03 ~]# rpm -q pcs pcs-0.9.26-9.el6.noarch [root@ask-03 ~]# rm -f /var/lib/pacemaker/cib/* [root@ask-03 ~]# rm -f /var/lib/pacemaker/pengine/* [root@ask-03 ~]# service corosync start Starting Corosync Cluster Engine (corosync): [ OK ] [root@ask-03 ~]# service pacemaker start ... [root@ask-03 ~]# pcs property set stonith-enabled=false [root@ask-03 ~]# pcs property set no-quorum-policy=ignore [root@ask-03 ~]# pcs resource create A ocf:pacemaker:Dummy op monitor interval=10s [root@ask-03 ~]# sleep 5 [root@ask-03 ~]# pcs resource delete A Deleting Resource - A [root@ask-03 ~]# sleep 120 [root@ask-03 ~]# Last updated: Tue Jan 15 15:52:07 2013 Last change: Tue Jan 15 15:50:06 2013 via cibadmin on ask-03 Stack: classic openais (with plugin) Current DC: ask-03 - partition WITHOUT quorum Version: 1.1.8-7.el6-394e906 1 Nodes configured, 2 expected votes 0 Resources configured. Online: [ ask-03 ] [root@ask-03 ~]# One additional issue was found with the previous patch, adding a patch which fixes the problem. https://github.com/feist/pcs/commit/2011cf7446f10a172048d105122ef96b839014aa To test new fix: (basically the same as above, but with a master/slave resource) Before Fix: [root@ask-03 ~]# rm -f /var/lib/pacemaker/cib/* [root@ask-03 ~]# rm -f /var/lib/pacemaker/pengine/* [root@ask-03 ~]# service corosync start Starting Corosync Cluster Engine (corosync): [ OK ] [root@ask-03 ~]# service pacemaker start ... [root@ask-03 ~]# pcs property set stonith-enabled=false [root@ask-03 ~]# pcs property set no-quorum-policy=ignore [root@ask-03 ~]# pcs resource create A ocf:pacemaker:Dummy op monitor interval=10s [root@ask-03 ~]# pcs resource master ma A [root@ask-03 ~]# sleep 5 [root@ask-03 ~]# pcs resource delete ma [root@ask-03 ~]# sleep 120 [root@ask-03 ~]# crm_mon -1 Last updated: Tue Jan 22 12:41:46 2013 Last change: Tue Jan 22 12:39:43 2013 via crmd on ask-03 Stack: classic openais (with plugin) Current DC: ask-03 - partition WITHOUT quorum Version: 1.1.8-7.el6-394e906 1 Nodes configured, 2 expected votes 6 Resources configured. Online: [ ask-03 ] Resource Group: GROUP1 Q (ocf::pacemaker:Dummy): Started ask-03 R (ocf::pacemaker:Dummy): Started ask-03 S (ocf::pacemaker:Dummy): Started ask-03 Resource Group: GROUP2 T (ocf::pacemaker:Dummy): Started ask-03 U (ocf::pacemaker:Dummy): Started ask-03 V (ocf::pacemaker:Dummy): Started ask-03 A (ocf::pacemaker:Dummy): ORPHANED Started ask-03 (unmanaged) FAILED (The important line contains the "(unmanaged) FAILED") After fix: [root@ask-03 ~]# rpm -q pcs pcs-0.9.26-10.el6.noarch [root@ask-03 ~]# pcs resource create B ocf:pacemaker:Dummy op monitor interval=10s [root@ask-03 ~]# pcs resource master mb B [root@ask-03 ~]# sleep 5 [root@ask-03 ~]# pcs resource delete mb [root@ask-03 ~]# sleep 120 [root@ask-03 ~]# crm_mon -1 [root@ask-03 ~]# crm_mon -1 Last updated: Tue Jan 22 12:47:12 2013 Last change: Tue Jan 22 12:45:09 2013 via cibadmin on ask-03 Stack: classic openais (with plugin) Current DC: ask-03 - partition WITHOUT quorum Version: 1.1.8-7.el6-394e906 1 Nodes configured, 2 expected votes 6 Resources configured. Online: [ ask-03 ] Resource Group: GROUP1 Q (ocf::pacemaker:Dummy): Started ask-03 R (ocf::pacemaker:Dummy): Started ask-03 S (ocf::pacemaker:Dummy): Started ask-03 Resource Group: GROUP2 T (ocf::pacemaker:Dummy): Started ask-03 U (ocf::pacemaker:Dummy): Started ask-03 V (ocf::pacemaker:Dummy): Started ask-03 A (ocf::pacemaker:Dummy): ORPHANED Started ask-03 (unmanaged) FAILED (only show's A as unmanaged, failed, B/mb is gone) Seems fixed in pcs-0.9.26-10.el6.noarch. Operations on resources (both normal and master/slave) are now working as expected, allowing the resource to be recreated and without any failures. Error messages from syslog are also gone (especially those mentioning bad resource UUID). See bug 880249 for more details. Marking as verified with pcs-0.9.26-10.el6.noarch. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHEA-2013-0369.html |