893221 – pcs should not delete a resource from the lrm during resource removal

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 893221 - pcs should not delete a resource from the lrm during resource removal

Summary: pcs should not delete a resource from the lrm during resource removal

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 6
Classification:	Red Hat
Component:	pcs
Sub Component:
Version:	6.4
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	rc
Target Release:	---
Assignee:	Chris Feist
QA Contact:	Cluster QE
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	768522 880249 902691
TreeView+	depends on / blocked

Reported:	2013-01-08 23:19 UTC by Chris Feist
Modified:	2013-02-21 09:49 UTC (History)
CC List:	8 users (show)
Fixed In Version:	pcs-0.9.26-10.el6
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2013-02-21 09:49:54 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHEA-2013:0369	0	normal	SHIPPED_LIVE	new packages: pcs	2013-02-20 20:52:38 UTC

Description Chris Feist 2013-01-08 23:19:58 UTC

pcs should not delete a resource from the lrm during resource removal, it should allow pacemaker to clean up the resources on it's own.

If a resource is still running with pacemaker and pcs removes it from the lrm from underneath pacemaker problems can occur.

Code to fix issue is here:
https://github.com/feist/pcs/commit/e191dfe110b158dd4219d9183f1712cc63580308

Comment 3 David Vossel 2013-01-10 19:27:26 UTC

This can cause some rather confusing problems with pacemaker.  Depending on the timing, this can make it look like the stop action for a resource fails after it is removed from the CIB.  After the stop action is reported to have failed, pacemaker treats that resource as unmanaged and bad things happen.

Here's what I think is going on... The source of the problem is that crm_resource -C is removing the resource entry from both the lrmd and crmd cache after the cib entry has been removed but before  the deleted resource's stop action completes. When the successful stop action for the resource is reported back to the crmd, it gets ignored because it doesn't recognize the resource exists. Because of this the cib update containing the successful stop operation never goes out.  Eventually the stop action's timer in the crmd pops and the action is treated as a failure due to timeout.

At some point pacemaker needs to address this condition, but in reality pcs should not be both deleting the resource from the cib and the lrmd.  Pacemaker should clean everything up properly when the resource is deleted from the cib.

I can confirm this patch to pcs resolves the issue I have outlined above. Here are my test results.

-------- WITHOUT PCS PATCH (A FAILS ON STOP ACTION)---------------
# rm -f /var/lib/pacemaker/cib/*
# rm -f /var/lib/pacemaker/pengine/*
# service corosync start
# service pacemaker start
# sleep 60
# pcs property set stonith-enabled=false
# pcs property set no-quorum-policy=ignore
# pcs resource create A ocf:pacemaker:Dummy op monitor interval=10s
# sleep 5
# pcs resource delete A
Deleting Resource - A
# sleep 120
# crm_mon -1
Last updated: Thu Jan 10 18:47:33 2013
Last change: Thu Jan 10 18:45:33 2013 via crmd on 18builder
Stack: corosync
Current DC: 18builder (4) - partition WITHOUT quorum
Version: 1.1.8-6f6a7fd
4 Nodes configured, unknown expected votes
0 Resources configured.


Online: [ 18builder ]
OFFLINE: [ 18node1 18node2 18node3 ]

 A	(ocf::pacemaker:Dummy):	 ORPHANED Started 18builder (unmanaged) FAILED

Failed actions:
    A_stop_0 (node=18builder, call=-1, rc=1, status=Timed Out): unknown error


---------- WITH PCS PATCH (NO FAILURES) -----------------
# rm -f /var/lib/pacemaker/cib/*
# rm -f /var/lib/pacemaker/pengine/*
# service corosync start
# service pacemaker start
# sleep 60
# pcs property set stonith-enabled=false
# pcs property set no-quorum-policy=ignore
# pcs resource create A ocf:pacemaker:Dummy op monitor interval=10s
# sleep 5
# pcs resource delete A
Deleting Resource - A
# crm_mon -1
Last updated: Thu Jan 10 18:52:54 2013
Last change: Thu Jan 10 18:50:54 2013 via cibadmin on 18builder
Stack: corosync
Current DC: 18builder (4) - partition WITHOUT quorum
Version: 1.1.8-6f6a7fd
4 Nodes configured, unknown expected votes
0 Resources configured.


Online: [ 18builder ]
OFFLINE: [ 18node1 18node2 18node3 ]

Comment 4 Chris Feist 2013-01-15 21:50:31 UTC

Before fix:
[root@ask-03 ~]# rm -f /var/lib/pacemaker/cib/*
[root@ask-03 ~]# rm -f /var/lib/pacemaker/pengine/*
[root@ask-03 ~]# service corosync start
Starting Corosync Cluster Engine (corosync):               [  OK  ]
[root@ask-03 ~]# service pacemaker start
...
[root@ask-03 ~]# pcs property set stonith-enabled=false
[root@ask-03 ~]# pcs property set no-quorum-policy=ignore
[root@ask-03 ~]# pcs resource create A ocf:pacemaker:Dummy op monitor interval=10s
[root@ask-03 ~]# sleep 5
[root@ask-03 ~]# pcs resource delete A
Deleting Resource - A
[root@ask-03 ~]# sleep 120
[root@ask-03 ~]# crm_mon -1
Last updated: Tue Jan 15 15:40:35 2013
Last change: Tue Jan 15 15:38:31 2013 via crmd on ask-03
Stack: classic openais (with plugin)
Current DC: ask-03 - partition WITHOUT quorum
Version: 1.1.8-7.el6-394e906
1 Nodes configured, 2 expected votes
0 Resources configured.


Online: [ ask-03 ]

 A	(ocf::pacemaker:Dummy):	 ORPHANED Started ask-03 (unmanaged) FAILED

Failed actions:
    A_stop_0 (node=ask-03, call=-1, rc=1, status=Timed Out): unknown error


After Fix:
Make sure the cluster is shutdown ('killall crmd' was necessary before 'pcs cluster stop')

[root@ask-03 ~]# rpm -q pcs
pcs-0.9.26-9.el6.noarch
[root@ask-03 ~]# rm -f /var/lib/pacemaker/cib/*
[root@ask-03 ~]# rm -f /var/lib/pacemaker/pengine/*
[root@ask-03 ~]# service corosync start
Starting Corosync Cluster Engine (corosync):               [  OK  ]
[root@ask-03 ~]# service pacemaker start
...
[root@ask-03 ~]# pcs property set stonith-enabled=false
[root@ask-03 ~]# pcs property set no-quorum-policy=ignore
[root@ask-03 ~]# pcs resource create A ocf:pacemaker:Dummy op monitor interval=10s
[root@ask-03 ~]# sleep 5
[root@ask-03 ~]# pcs resource delete A
Deleting Resource - A
[root@ask-03 ~]# sleep 120
[root@ask-03 ~]# Last updated: Tue Jan 15 15:52:07 2013
Last change: Tue Jan 15 15:50:06 2013 via cibadmin on ask-03
Stack: classic openais (with plugin)
Current DC: ask-03 - partition WITHOUT quorum
Version: 1.1.8-7.el6-394e906
1 Nodes configured, 2 expected votes
0 Resources configured.


Online: [ ask-03 ]

[root@ask-03 ~]#

Comment 6 Chris Feist 2013-01-22 18:20:59 UTC

One additional issue was found with the previous patch, adding a patch which fixes the problem.

https://github.com/feist/pcs/commit/2011cf7446f10a172048d105122ef96b839014aa

Comment 8 Chris Feist 2013-01-22 18:45:15 UTC

To test new fix: (basically the same as above, but with a master/slave resource)

Before Fix:
[root@ask-03 ~]# rm -f /var/lib/pacemaker/cib/*
[root@ask-03 ~]# rm -f /var/lib/pacemaker/pengine/*
[root@ask-03 ~]# service corosync start
Starting Corosync Cluster Engine (corosync):               [  OK  ]
[root@ask-03 ~]# service pacemaker start
...
[root@ask-03 ~]# pcs property set stonith-enabled=false
[root@ask-03 ~]# pcs property set no-quorum-policy=ignore
[root@ask-03 ~]# pcs resource create A ocf:pacemaker:Dummy op monitor interval=10s
[root@ask-03 ~]# pcs resource master ma A
[root@ask-03 ~]# sleep 5
[root@ask-03 ~]# pcs resource delete ma
[root@ask-03 ~]# sleep 120
[root@ask-03 ~]# crm_mon -1
Last updated: Tue Jan 22 12:41:46 2013
Last change: Tue Jan 22 12:39:43 2013 via crmd on ask-03
Stack: classic openais (with plugin)
Current DC: ask-03 - partition WITHOUT quorum
Version: 1.1.8-7.el6-394e906
1 Nodes configured, 2 expected votes
6 Resources configured.


Online: [ ask-03 ]

 Resource Group: GROUP1
     Q	(ocf::pacemaker:Dummy):	Started ask-03
     R	(ocf::pacemaker:Dummy):	Started ask-03
     S	(ocf::pacemaker:Dummy):	Started ask-03
 Resource Group: GROUP2
     T	(ocf::pacemaker:Dummy):	Started ask-03
     U	(ocf::pacemaker:Dummy):	Started ask-03
     V	(ocf::pacemaker:Dummy):	Started ask-03
 A	(ocf::pacemaker:Dummy):	 ORPHANED Started ask-03 (unmanaged) FAILED


(The important line contains the "(unmanaged) FAILED")

After fix:
[root@ask-03 ~]# rpm -q pcs
pcs-0.9.26-10.el6.noarch
[root@ask-03 ~]# pcs resource create B ocf:pacemaker:Dummy op monitor interval=10s
[root@ask-03 ~]# pcs resource master mb B
[root@ask-03 ~]# sleep 5
[root@ask-03 ~]# pcs resource delete mb
[root@ask-03 ~]# sleep 120
[root@ask-03 ~]# crm_mon -1
[root@ask-03 ~]# crm_mon -1
Last updated: Tue Jan 22 12:47:12 2013
Last change: Tue Jan 22 12:45:09 2013 via cibadmin on ask-03
Stack: classic openais (with plugin)
Current DC: ask-03 - partition WITHOUT quorum
Version: 1.1.8-7.el6-394e906
1 Nodes configured, 2 expected votes
6 Resources configured.


Online: [ ask-03 ]

 Resource Group: GROUP1
     Q	(ocf::pacemaker:Dummy):	Started ask-03
     R	(ocf::pacemaker:Dummy):	Started ask-03
     S	(ocf::pacemaker:Dummy):	Started ask-03
 Resource Group: GROUP2
     T	(ocf::pacemaker:Dummy):	Started ask-03
     U	(ocf::pacemaker:Dummy):	Started ask-03
     V	(ocf::pacemaker:Dummy):	Started ask-03
 A	(ocf::pacemaker:Dummy):	 ORPHANED Started ask-03 (unmanaged) FAILED



(only show's A as unmanaged, failed, B/mb is gone)

Comment 9 Jaroslav Kortus 2013-01-23 10:45:03 UTC

Seems fixed in pcs-0.9.26-10.el6.noarch.

Operations on resources (both normal and master/slave) are now working as expected, allowing the resource to be recreated and without any failures.

Error messages from syslog are also gone (especially those mentioning bad resource UUID). See bug 880249 for more details.

Marking as verified with pcs-0.9.26-10.el6.noarch.

Comment 11 errata-xmlrpc 2013-02-21 09:49:54 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHEA-2013-0369.html

Note You need to log in before you can comment on or make changes to this bug.