1843593 – [GUI] pcsd is sometimes unable to remove resources

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1843593 - [GUI] pcsd is sometimes unable to remove resources

Summary: [GUI] pcsd is sometimes unable to remove resources

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 7
Classification:	Red Hat
Component:	pcs
Sub Component:
Version:	7.9
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	rc
Target Release:	---
Assignee:	Ondrej Mular
QA Contact:	cluster-qe@redhat.com
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-06-03 15:40 UTC by Nina Hostakova
Modified:	2020-09-29 20:10 UTC (History)
CC List:	9 users (show)
Fixed In Version:	pcs-0.9.169-3.el7
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	1849994 (view as bug list)
Environment:
Last Closed:	2020-09-29 20:10:26 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Bugzilla	1346852	0	high	CLOSED	[GUI] Bad Request when resource removal takes longer than pcs expects	2021-02-22 00:41:40 UTC
Red Hat Bugzilla	1579911	1	None	None	None	2021-01-20 06:05:38 UTC
Red Hat Product Errata	RHBA-2020:3964	0	None	None	None	2020-09-29 20:10:42 UTC

Description Nina Hostakova 2020-06-03 15:40:41 UTC

Description of problem:
When trying to remove resources via web UI, the action may in some cases finish with this timeout message: 'Operation takes longer to complete than expected'. The resources might not be removed after all, but the behavior varies (see bellow).

Version-Release number of selected component (if applicable):
pcs-0.9.169-2.el7.x86_64

How reproducible:
Reproduced on virtual (more often) as well as physical machines, but not always.

Steps to Reproduce:
Despite there are not reliable steps to reproduce every time, these were the most frequently problematic:

1.a) in web UI, add multiple resources of different kinds (with required parameters set to 123)
e.g: ocf:heartbeat:aliyun-vpc-move-ip,
ocf:heartbeat:apache,
ocf:heartbeat:aws-vpc-move-ip,
ocf:heartbeat:aws-vpc-route53,
ocf:heartbeat:awseip)

1.b) try to remove 1, multiple or all of them

2.a) add multiple resources of the same type
e.g: 15x ocf:heartbeat:apache

2.b) try to remove all of them

Actual results:
We encountered these scenarios:
1. Resources are removed without any problem
2. Timeout message and the resources are removed
3. Timeout message and resources are removed in a longer time (~ minutes)
4. Timeout message and resources are not removed (in a relevant amount of time)
5. Timeout message and only a part of the resources is removed

In case the resources are not removed, they become disabled but still present also in CLI

Expected results:
The resources should be removed upon the action is taken. In case there is a problem (e.g. the timeout), it would be appropriate to display a message that some of the resources cannot be removed or the removal will take additional time.

Comment 2 Ondrej Mular 2020-06-10 13:30:48 UTC

Just for a reference, part of this was previously addressed in bz#1346852 and bz#1579911

Comment 5 Ondrej Mular 2020-06-16 12:41:47 UTC

After hours and hours of trying to reproduce this issue, investigating possible cause and reading through logs provided by QE, we've observed that the issue is reproducible mainly in these conditions:
 * machines are under heavy load
 * removed resources have failed actions

We've been able to identify and fix a possible race-condition when removing a resource with failed actions. Upstream patch: https://github.com/ClusterLabs/pcs/commit/b78e5b448e4a2dc2ad1e4df45515ceb01f6f103e

This patch alone will prevent only a subset of possible failures and there is still a chance of a different race-condition (reproducible on machines under heavy load) which we are unable to prevent without complete overhaul of `pcs resource delete` command tracked in bz#1420298

Comment 9 Ondrej Mular 2020-06-18 09:09:50 UTC

Upstream patch linked in comment#5

Comment 13 Michal Mazourek 2020-06-23 11:32:04 UTC

Result:
=======
As explained in comment 5, this patch alone will prevent only a subset of possible failures, but there is no more time to focus this problem in RHEL7, therefore this bug will be cloned into RHEL8. The clone will probably depend on bz1420298, as mentioned in comment 5. This patch also clarifies timeout message when deleting multiple resources. The problem with stuck resources after multi delete still occurs, but for the above reasons, marking as VERIFIED in pcs-0.9.169-3.el7.


Testing:
========

[root@virt-026 ~]# rpm -q pcs
pcs-0.9.169-3.el7.x86_64


1. Login to web UI: (e.g.: https://virt-026.cluster-qe.lab.eng.brq.redhat.com:2224)
2. Create a cluster and add it to web UI
3. Go to Resources page of the cluster
4. Add 30x ocf:heartbeat Dummy (running, no fails)
5. Select them all and try to remove them (without Enforce removal)

> Repeating point 4 and 5 several times, since the problem isn't 100 % reproducible

# cases:
a) Resources got removed and no warning message appears
> OK

b) Warning message appears with changed text: 'Operation takes longer to complete than expected but it will continue in the background.' and the resources got removed.
> OK, the message is more clear now. Sometimes it takes some time (minutes) to remove the resources, but it will eventually

c) Warning message appears and most of the resources got removed in minutes. Few resources are disabled, but not removed after 20 minutes.
> Need to focus this issue in the clone

# case, when no resource got deleted wasn't reproduced.


6. Adding these resources with random parameters (such as '123'), which will generate fails
	ocf:heartbeat:aliyun-vpc-move-ip, 
	ocf:heartbeat:apache,
	ocf:heartbeat:aws-vpc-move-ip, 
	ocf:heartbeat:aws-vpc-route53, 
	ocf:heartbeat:awseip
7. Select them all and try to remove them (without Enforce removal)

> Repeating point 6 and 7 several times, since the problem isn't 100 % reproducible

# cases:
a) Resources got removed and no warning message appears
> OK

b) Warning message appeared with changed text: 'Operation takes longer to complete than expected but it will continue in the background.' and the resources got deleted (sometimes after a few minutes)
> OK, this was the most frequent case

# case, when any resource got stucked wasn't reproduced


8. Adding 10 random resources with random parameters
        ocf:heartbeat:aliyun-vpc-move-ip,
        ocf:heartbeat:apache,
        ocf:heartbeat:aws-vpc-move-ip,
        ocf:heartbeat:aws-vpc-route53,
        ocf:heartbeat:awseip
	ocf:heartbeat:awsvip
	ocf:heartbeat:azure-events
	ocf:heartbeat:azure-lb
	ocf:heartbeat:clvm
	ocf:heartbeat:conntrackd
9. Select them all and try to remove them (without Enforce removal)

# case: All resources got disabled with timeout message, but not deleted (after 2+ hours).
> Need to focus this issue in the clone

Comment 15 errata-xmlrpc 2020-09-29 20:10:26 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (pcs bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:3964

Note You need to log in before you can comment on or make changes to this bug.