1220512 – pcs resource cleanup improvements

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1220512 - pcs resource cleanup improvements

Summary: pcs resource cleanup improvements

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 7
Classification:	Red Hat
Component:	pcs
Sub Component:
Version:	7.2
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	unspecified
Target Milestone:	rc
Target Release:	---
Assignee:	Tomas Jelinek
QA Contact:	cluster-qe@redhat.com
Docs Contact:
URL:
Whiteboard:
Duplicates (2):	1323901 1366514 (view as bug list)
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2015-05-11 17:47 UTC by David Vossel
Modified:	2017-11-14 15:15 UTC (History)
CC List:	8 users (show)
Fixed In Version:	pcs-0.9.151-1.el7
Doc Type:	Bug Fix
Doc Text:	Cause: User runs 'pcs resource cleanup' command in a cluster with high number of resources and/or nodes. Consequence: Cluster may get less responsive for a while. Fix: Display a warning describing the negative impact of the command if appropriate. Add options to the command to specify resource and/or node to run on. Result: User is informed about negative impacts and has options to reduce it while being able to perform desired operation.
Clone Of:
Environment:
Last Closed:	2016-11-03 20:54:04 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
proposed fix (16.07 KB, patch) 2016-02-26 16:06 UTC, Tomas Jelinek	no flags	Details \| Diff
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Bugzilla	1508351	0	high	CLOSED	pcs resource cleanup is overkill in most scenarios	2021-02-22 00:41:40 UTC
Red Hat Product Errata	RHSA-2016:2596	0	normal	SHIPPED_LIVE	Moderate: pcs security, bug fix, and enhancement update	2016-11-03 12:11:34 UTC

Internal Links: 1508351

Description David Vossel 2015-05-11 17:47:17 UTC

Description of problem:

'pcs resource cleanup' translates directly to crm_resource -C. Behind the scenes this command results in pacemaker completely wiping all of its resource state history. To rebuild resource state, pacemaker must execute a monitor operation for every resource within the cluster on every node within the cluster. This means crm_resource -C will always result in (resources * nodes) operations being executed on the cluster.

For small clusters, this isn't a big deal. 3 nodes with 10 resources is equal to 30 monitor operations in order for pacemaker to rebuild state. However, large clusters, 16 nodes with 100 resources would result in 1600 monitor operations before pacemaker can rebuild state.

Looking through customer's logs I'm seeing a trend emerge. If there's ever a failure, users just run 'pcs resource cleanup' to make it go away. This is starting to cause problems though because on large clusters 'pcs resource cleanup' will result in the pacemaker cluster appearing to be unresponsive for several minutes while all the monitor operations are in flight.

To keep people from unintentionally hosing up their clusters I think pcs resource cleanup needs some more options and safe guards in place.

1. we need a way to specify the node the cleanup should occur on. 'crm_resource -C -N <node>' allows us to only re-detect resource status on a single node. Node can be combined with resource id as well. 'crm_resource -C -N <node> -r <resource id>'. This allows us to only re-detect a single resource on a single node rather than every resource on a node.

2. We should consider a requiring a --force option for 'pcs resource cleanup' when we detect the command will generate enough monitor operations to negatively impact the responsiveness of the cluster.

For example, if someone issues 'pcs resource cleanup' on the cluster with 16 nodes and 100 resources, we should be able to detect that's going to result in 1600 operations and warn the user requiring them to use --force to proceed with the command.

detection of 100 or more resulting operations seems like a decent threshold for requiring --force.

Comment 4 Tomas Jelinek 2016-02-26 16:06:06 UTC

Created attachment 1130861 [details]
proposed fix

Test:
Add nodes and/or resources to cluster so that number of resources times number of nodes exceeds 100.
[root@rh72-node1:~]# pcs status | grep configured
2 nodes and 53 resources configured
[root@rh72-node1:~]# pcs resource cleanup
Error: Cleaning up all resources on all nodes will execute more than 100 operations in the cluster, which may negatively impact the responsiveness of the cluster. Consider specifying resource and/or node, use --force to override
[root@rh72-node1:~]# echo $?
1
[root@rh72-node1:~]# pcs resource cleanup dummy
Waiting for 2 replies from the CRMd.. OK
Cleaning up dummy on rh72-node1, removing fail-count-dummy
Cleaning up dummy on rh72-node2, removing fail-count-dummy
[root@rh72-node1:~]# echo $?
0
[root@rh72-node1:~]# pcs resource cleanup --node rh72-node1
Waiting for 1 replies from the CRMd. OK
[root@rh72-node1:~]# echo $?
0
[root@rh72-node1:~]# pcs resource cleanup --node rh72-node1 dummy
Waiting for 1 replies from the CRMd. OK
Cleaning up dummy on rh72-node1, removing fail-count-dummy
[root@rh72-node1:~]# echo $?
0
[root@rh72-node1:~]# pcs resource cleanup --force
Waiting for 1 replies from the CRMd. OK
[root@rh72-node1:~]# echo $?
0

[root@rh72-node1:~]# pcs status | grep configured
2 nodes and 3 resources configured
[root@rh72-node1:~]# pcs resource cleanup
Waiting for 1 replies from the CRMd. OK

Comment 5 Mike McCune 2016-03-28 23:40:50 UTC

This bug was accidentally moved from POST to MODIFIED via an error in automation, please see mmccune with any questions

Comment 6 Tomas Jelinek 2016-04-05 07:26:13 UTC

*** Bug 1323901 has been marked as a duplicate of this bug. ***

Comment 7 Ivan Devat 2016-05-31 12:25:10 UTC

Setup:
[vm-rhel72-1 ~] $ for i in {a..b}; do for j in {a..z}; do pcs resource create ${i}${j} Dummy; done ;done
[vm-rhel72-1 ~] $ pcs status | grep configured
2 nodes and 52 resources configured

Before fix:
[vm-rhel72-1 ~] $ rpm -q pcs
pcs-0.9.143-15.el7.x86_64

[vm-rhel72-1 ~] $ pcs resource cleanup
Waiting for 1 replies from the CRMd. OK

[vm-rhel72-1 ~] $ pcs resource cleanup --node vm-rhel72-1 aa
Waiting for 2 replies from the CRMd.. OK
Cleaning up aa on vm-rhel72-1, removing fail-count-aa
Cleaning up aa on vm-rhel72-3, removing fail-count-aa


After Fix:
[vm-rhel72-1 ~] $ rpm -q pcs
pcs-0.9.151-1.el7.x86_64

[vm-rhel72-1 ~] $ pcs resource cleanup
Error: Cleaning up all resources on all nodes will execute more than 100 operations in the cluster, which may negatively impact the responsiveness of the cluster. Consider specifying resource and/or node, use --force to override
[vm-rhel72-1 ~] $ echo $?
1

[vm-rhel72-1 ~] $ pcs resource cleanup --node vm-rhel72-1 aa
Waiting for 1 replies from the CRMd. OK
Cleaning up aa on vm-rhel72-1, removing fail-count-aa

[vm-rhel72-1 ~] $ for i in {a..b}; do for j in {a..z}; do pcs resource delete ${i}${j} Dummy; done ;done
[vm-rhel72-1 ~] $ for i in {a..z}; do pcs resource create ${i} Dummy; done
[vm-rhel72-1 ~] $ pcs status | grep configured
2 nodes and 52 resources configured

[vm-rhel72-1 ~] $ pcs resource cleanup
Waiting for 1 replies from the CRMd. OK

Comment 10 Tomas Jelinek 2016-09-23 07:25:21 UTC

*** Bug 1366514 has been marked as a duplicate of this bug. ***

Comment 12 errata-xmlrpc 2016-11-03 20:54:04 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHSA-2016-2596.html

Note You need to log in before you can comment on or make changes to this bug.