Bug 893221

Summary:	pcs should not delete a resource from the lrm during resource removal
Product:	Red Hat Enterprise Linux 6	Reporter:	Chris Feist <cfeist>
Component:	pcs	Assignee:	Chris Feist <cfeist>
Status:	CLOSED ERRATA	QA Contact:	Cluster QE <mspqa-list>
Severity:	high	Docs Contact:
Priority:	high
Version:	6.4	CC:	abeekhof, cluster-maint, dvossel, fdanapfe, jkortus, lhh, rsteiger, tlavigne
Target Milestone:	rc
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	pcs-0.9.26-10.el6	Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2013-02-21 09:49:54 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	768522, 880249, 902691

Description Chris Feist 2013-01-08 23:19:58 UTC

pcs should not delete a resource from the lrm during resource removal, it should allow pacemaker to clean up the resources on it's own.

If a resource is still running with pacemaker and pcs removes it from the lrm from underneath pacemaker problems can occur.

Code to fix issue is here:
https://github.com/feist/pcs/commit/e191dfe110b158dd4219d9183f1712cc63580308

Comment 3 David Vossel 2013-01-10 19:27:26 UTC

This can cause some rather confusing problems with pacemaker.  Depending on the timing, this can make it look like the stop action for a resource fails after it is removed from the CIB.  After the stop action is reported to have failed, pacemaker treats that resource as unmanaged and bad things happen.

Here's what I think is going on... The source of the problem is that crm_resource -C is removing the resource entry from both the lrmd and crmd cache after the cib entry has been removed but before  the deleted resource's stop action completes. When the successful stop action for the resource is reported back to the crmd, it gets ignored because it doesn't recognize the resource exists. Because of this the cib update containing the successful stop operation never goes out.  Eventually the stop action's timer in the crmd pops and the action is treated as a failure due to timeout.

At some point pacemaker needs to address this condition, but in reality pcs should not be both deleting the resource from the cib and the lrmd.  Pacemaker should clean everything up properly when the resource is deleted from the cib.

I can confirm this patch to pcs resolves the issue I have outlined above. Here are my test results.

-------- WITHOUT PCS PATCH (A FAILS ON STOP ACTION)---------------
# rm -f /var/lib/pacemaker/cib/*
# rm -f /var/lib/pacemaker/pengine/*
# service corosync start
# service pacemaker start
# sleep 60
# pcs property set stonith-enabled=false
# pcs property set no-quorum-policy=ignore
# pcs resource create A ocf:pacemaker:Dummy op monitor interval=10s
# sleep 5
# pcs resource delete A
Deleting Resource - A
# sleep 120
# crm_mon -1
Last updated: Thu Jan 10 18:47:33 2013
Last change: Thu Jan 10 18:45:33 2013 via crmd on 18builder
Stack: corosync
Current DC: 18builder (4) - partition WITHOUT quorum
Version: 1.1.8-6f6a7fd
4 Nodes configured, unknown expected votes
0 Resources configured.


Online: [ 18builder ]
OFFLINE: [ 18node1 18node2 18node3 ]

 A	(ocf::pacemaker:Dummy):	 ORPHANED Started 18builder (unmanaged) FAILED

Failed actions:
    A_stop_0 (node=18builder, call=-1, rc=1, status=Timed Out): unknown error


---------- WITH PCS PATCH (NO FAILURES) -----------------
# rm -f /var/lib/pacemaker/cib/*
# rm -f /var/lib/pacemaker/pengine/*
# service corosync start
# service pacemaker start
# sleep 60
# pcs property set stonith-enabled=false
# pcs property set no-quorum-policy=ignore
# pcs resource create A ocf:pacemaker:Dummy op monitor interval=10s
# sleep 5
# pcs resource delete A
Deleting Resource - A
# crm_mon -1
Last updated: Thu Jan 10 18:52:54 2013
Last change: Thu Jan 10 18:50:54 2013 via cibadmin on 18builder
Stack: corosync
Current DC: 18builder (4) - partition WITHOUT quorum
Version: 1.1.8-6f6a7fd
4 Nodes configured, unknown expected votes
0 Resources configured.


Online: [ 18builder ]
OFFLINE: [ 18node1 18node2 18node3 ]

Comment 4 Chris Feist 2013-01-15 21:50:31 UTC

Before fix:
[root@ask-03 ~]# rm -f /var/lib/pacemaker/cib/*
[root@ask-03 ~]# rm -f /var/lib/pacemaker/pengine/*
[root@ask-03 ~]# service corosync start
Starting Corosync Cluster Engine (corosync):               [  OK  ]
[root@ask-03 ~]# service pacemaker start
...
[root@ask-03 ~]# pcs property set stonith-enabled=false
[root@ask-03 ~]# pcs property set no-quorum-policy=ignore
[root@ask-03 ~]# pcs resource create A ocf:pacemaker:Dummy op monitor interval=10s
[root@ask-03 ~]# sleep 5
[root@ask-03 ~]# pcs resource delete A
Deleting Resource - A
[root@ask-03 ~]# sleep 120
[root@ask-03 ~]# crm_mon -1
Last updated: Tue Jan 15 15:40:35 2013
Last change: Tue Jan 15 15:38:31 2013 via crmd on ask-03
Stack: classic openais (with plugin)
Current DC: ask-03 - partition WITHOUT quorum
Version: 1.1.8-7.el6-394e906
1 Nodes configured, 2 expected votes
0 Resources configured.


Online: [ ask-03 ]

 A	(ocf::pacemaker:Dummy):	 ORPHANED Started ask-03 (unmanaged) FAILED

Failed actions:
    A_stop_0 (node=ask-03, call=-1, rc=1, status=Timed Out): unknown error


After Fix:
Make sure the cluster is shutdown ('killall crmd' was necessary before 'pcs cluster stop')

[root@ask-03 ~]# rpm -q pcs
pcs-0.9.26-9.el6.noarch
[root@ask-03 ~]# rm -f /var/lib/pacemaker/cib/*
[root@ask-03 ~]# rm -f /var/lib/pacemaker/pengine/*
[root@ask-03 ~]# service corosync start
Starting Corosync Cluster Engine (corosync):               [  OK  ]
[root@ask-03 ~]# service pacemaker start
...
[root@ask-03 ~]# pcs property set stonith-enabled=false
[root@ask-03 ~]# pcs property set no-quorum-policy=ignore
[root@ask-03 ~]# pcs resource create A ocf:pacemaker:Dummy op monitor interval=10s
[root@ask-03 ~]# sleep 5
[root@ask-03 ~]# pcs resource delete A
Deleting Resource - A
[root@ask-03 ~]# sleep 120
[root@ask-03 ~]# Last updated: Tue Jan 15 15:52:07 2013
Last change: Tue Jan 15 15:50:06 2013 via cibadmin on ask-03
Stack: classic openais (with plugin)
Current DC: ask-03 - partition WITHOUT quorum
Version: 1.1.8-7.el6-394e906
1 Nodes configured, 2 expected votes
0 Resources configured.


Online: [ ask-03 ]

[root@ask-03 ~]#

Comment 6 Chris Feist 2013-01-22 18:20:59 UTC

One additional issue was found with the previous patch, adding a patch which fixes the problem.

https://github.com/feist/pcs/commit/2011cf7446f10a172048d105122ef96b839014aa

Comment 8 Chris Feist 2013-01-22 18:45:15 UTC

To test new fix: (basically the same as above, but with a master/slave resource)

Before Fix:
[root@ask-03 ~]# rm -f /var/lib/pacemaker/cib/*
[root@ask-03 ~]# rm -f /var/lib/pacemaker/pengine/*
[root@ask-03 ~]# service corosync start
Starting Corosync Cluster Engine (corosync):               [  OK  ]
[root@ask-03 ~]# service pacemaker start
...
[root@ask-03 ~]# pcs property set stonith-enabled=false
[root@ask-03 ~]# pcs property set no-quorum-policy=ignore
[root@ask-03 ~]# pcs resource create A ocf:pacemaker:Dummy op monitor interval=10s
[root@ask-03 ~]# pcs resource master ma A
[root@ask-03 ~]# sleep 5
[root@ask-03 ~]# pcs resource delete ma
[root@ask-03 ~]# sleep 120
[root@ask-03 ~]# crm_mon -1
Last updated: Tue Jan 22 12:41:46 2013
Last change: Tue Jan 22 12:39:43 2013 via crmd on ask-03
Stack: classic openais (with plugin)
Current DC: ask-03 - partition WITHOUT quorum
Version: 1.1.8-7.el6-394e906
1 Nodes configured, 2 expected votes
6 Resources configured.


Online: [ ask-03 ]

 Resource Group: GROUP1
     Q	(ocf::pacemaker:Dummy):	Started ask-03
     R	(ocf::pacemaker:Dummy):	Started ask-03
     S	(ocf::pacemaker:Dummy):	Started ask-03
 Resource Group: GROUP2
     T	(ocf::pacemaker:Dummy):	Started ask-03
     U	(ocf::pacemaker:Dummy):	Started ask-03
     V	(ocf::pacemaker:Dummy):	Started ask-03
 A	(ocf::pacemaker:Dummy):	 ORPHANED Started ask-03 (unmanaged) FAILED


(The important line contains the "(unmanaged) FAILED")

After fix:
[root@ask-03 ~]# rpm -q pcs
pcs-0.9.26-10.el6.noarch
[root@ask-03 ~]# pcs resource create B ocf:pacemaker:Dummy op monitor interval=10s
[root@ask-03 ~]# pcs resource master mb B
[root@ask-03 ~]# sleep 5
[root@ask-03 ~]# pcs resource delete mb
[root@ask-03 ~]# sleep 120
[root@ask-03 ~]# crm_mon -1
[root@ask-03 ~]# crm_mon -1
Last updated: Tue Jan 22 12:47:12 2013
Last change: Tue Jan 22 12:45:09 2013 via cibadmin on ask-03
Stack: classic openais (with plugin)
Current DC: ask-03 - partition WITHOUT quorum
Version: 1.1.8-7.el6-394e906
1 Nodes configured, 2 expected votes
6 Resources configured.


Online: [ ask-03 ]

 Resource Group: GROUP1
     Q	(ocf::pacemaker:Dummy):	Started ask-03
     R	(ocf::pacemaker:Dummy):	Started ask-03
     S	(ocf::pacemaker:Dummy):	Started ask-03
 Resource Group: GROUP2
     T	(ocf::pacemaker:Dummy):	Started ask-03
     U	(ocf::pacemaker:Dummy):	Started ask-03
     V	(ocf::pacemaker:Dummy):	Started ask-03
 A	(ocf::pacemaker:Dummy):	 ORPHANED Started ask-03 (unmanaged) FAILED



(only show's A as unmanaged, failed, B/mb is gone)

Comment 9 Jaroslav Kortus 2013-01-23 10:45:03 UTC

Seems fixed in pcs-0.9.26-10.el6.noarch.

Operations on resources (both normal and master/slave) are now working as expected, allowing the resource to be recreated and without any failures.

Error messages from syslog are also gone (especially those mentioning bad resource UUID). See bug 880249 for more details.

Marking as verified with pcs-0.9.26-10.el6.noarch.

Comment 11 errata-xmlrpc 2013-02-21 09:49:54 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHEA-2013-0369.html