1849994 – [GUI] pcsd is sometimes unable to remove resources

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1849994 - [GUI] pcsd is sometimes unable to remove resources

Summary: [GUI] pcsd is sometimes unable to remove resources

Keywords:
Status:	CLOSED WONTFIX
Alias:	None
Product:	Red Hat Enterprise Linux 8
Classification:	Red Hat
Component:	pcs
Sub Component:
Version:	8.2
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	rc
Target Release:	---
Assignee:	Tomas Jelinek
QA Contact:	cluster-qe@redhat.com
Docs Contact:
URL:
Whiteboard:
Depends On:	1420298
Blocks:
TreeView+	depends on / blocked

Reported:	2020-06-23 11:37 UTC by Michal Mazourek
Modified:	2021-12-23 07:27 UTC (History)
CC List:	11 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:	1843593
Environment:
Last Closed:	2021-12-23 07:27:00 UTC
Type:	Bug
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description Michal Mazourek 2020-06-23 11:37:51 UTC

+++ This bug was initially created as a clone of Bug #1843593 +++

Description of problem:
When trying to remove resources via web UI, the action may in some cases finish with this timeout message: 'Operation takes longer to complete than expected'. The resources might not be removed after all, but the behavior varies (see bellow).

Version-Release number of selected component (if applicable):
pcs-0.9.169-2.el7.x86_64

How reproducible:
Reproduced on virtual (more often) as well as physical machines, but not always. 

Steps to Reproduce:
Despite there are not reliable steps to reproduce every time, these were the most frequently problematic:

1.a) in web UI, add multiple resources of different kinds (with required parameters set to 123)
e.g: ocf:heartbeat:aliyun-vpc-move-ip, 
     ocf:heartbeat:apache,
     ocf:heartbeat:aws-vpc-move-ip, 
     ocf:heartbeat:aws-vpc-route53, 
     ocf:heartbeat:awseip)

1.b) try to remove 1, multiple or all of them

2.a) add multiple resources of the same type 
e.g: 15x ocf:heartbeat:apache

2.b) try to remove all of them

Actual results:
We encountered these scenarios:
1. Resources are removed without any problem
2. Timeout message and the resources are removed
3. Timeout message and resources are removed in a longer time (~ minutes)
4. Timeout message and resources are not removed (in a relevant amount of time)
5. Timeout message and only a part of the resources is removed

In case the resources are not removed, they become disabled but still present also in CLI

Expected results:
The resources should be removed upon the action is taken. In case there is a problem (e.g. the timeout), it would be appropriate to display a message that some of the resources cannot be removed or the removal will take additional time.

--- Additional comment from RHEL Program Management on 2020-06-03 15:40:47 UTC ---

Since this bug report was entered in Red Hat Bugzilla, the release flag has been set to ? to ensure that it is properly evaluated for this release.

--- Additional comment from Ondrej Mular on 2020-06-10 13:30:48 UTC ---

Just for a reference, part of this was previously addressed in bz#1346852 and bz#1579911

--- Additional comment from Michal Mazourek on 2020-06-10 15:08:17 UTC ---

Adding snippet from /var/log/pcs/pcsd.log in scenario, where resources were disabled but not deleted.

Scenario:
Theses resources were added in GUI with random required parameters (which may cause the resources to be failed or stopped):
ocf:heartbeat:aliyun-vpc-move-ip, 
ocf:heartbeat:apache,
ocf:heartbeat:aws-vpc-move-ip, 
ocf:heartbeat:aws-vpc-route53, 
ocf:heartbeat:awseip

Then via CLI, these commands were run:

# Login
curl --insecure --data "username=hacluster&password=password" --cookie-jar cookie.txt  https://virt-141.cluster-qe.lab.eng.brq.redhat.com:2224/login

# Deleting the resources
curl --insecure --cookie cookie.txt --data "no_error_if_not_exists=true&resid-a1=true&resid-a2=true&resid-a3=true&resid-a4=true&resid-a5=true" --header "X-Requested-With: XMLHttpRequest" "https://virt-141.cluster-qe.lab.eng.brq.redhat.com:2224/managec/test/remove_resource"

--- Additional comment from Michal Mazourek on 2020-06-10 15:11:44 UTC ---

A debug log from the second node from scenario in comment 3.

--- Additional comment from Ondrej Mular on 2020-06-16 12:41:47 UTC ---

After hours and hours of trying to reproduce this issue, investigating possible cause and reading through logs provided by QE, we've observed that the issue is reproducible mainly in these conditions:
 * machines are under heavy load
 * removed resources have failed actions

We've been able to identify and fix a possible race-condition when removing a resource with failed actions. Upstream patch: https://github.com/ClusterLabs/pcs/commit/b78e5b448e4a2dc2ad1e4df45515ceb01f6f103e

This patch alone will prevent only a subset of possible failures and there is still a chance of a different race-condition (reproducible on machines under heavy load) which we are unable to prevent without complete overhaul of `pcs resource delete` command tracked in bz#1420298

--- Additional comment from RHEL Program Management on 2020-06-17 13:10:39 UTC ---

A request has been made to complete this BZ after the deadline. Please follow the instructions in this comment and answer all 3 questions. You can fill out this form by clicking the [reply] link on this comment, and then reply in-line to this message. Please do this even if you believe the case is obvious, or already covered in the BZ, as a way to make it easier for reviewers to approve this request.

Verify the following information is set in the BZ: - Confirm the Release flag reflects the correct release - Set the Target Milestone field to Indicate when the work can be done (alpha, beta or rc).

Prepare responsible parties to take action: - Verify the subsystem team indicated in the Pool field support a release exception - Ensure qa_ack+ and devel_ack+ are set and assignees are ready to complete the work by the Target Milestone. Answer the following 3 questions:

1. What is the impact of waiting until the next release to include this BZ? Reviewers want to know which RHEL features or customers are affected and if it will impact any Layered Product or Hardware partner plans.

2. What is the risk associated with the fix? Reviewers want to know if the fix is contained, testable, and there is enough time to verify the work without impact the schedule or other commitments.

3. Provide any other details that should be weighed in making a decision (Other releases affected, upstream status, business impacts, etc).

--- Additional comment from Prokop Vlasin on 2020-06-17 15:54:58 UTC ---

Would you mind please responding on the questions above? This helps the voting members to assess the validity of the "Exception" request. Thank you very much.

--- Additional comment from Chris Feist on 2020-06-17 19:19:14 UTC ---

1.  This is the last 7.9 release, so if we don't get this fixed it will not be fixed in the RHEL lifecycle.  Customers will always have the potential to have failures when removing resources using the GUI
2.  QE and devel have approved this exception and are ready to test.  We have been able to reproduce it on the devel side.
3.  N/A

--- Additional comment from Ondrej Mular on 2020-06-18 09:09:50 UTC ---

Upstream patch linked in comment#5

--- Additional comment from Prokop Vlasin on 2020-06-18 14:16:27 UTC ---

Approved as an Exception+ on today's Blocker/Exception meeting on June 18th.

--- Additional comment from errata-xmlrpc on 2020-06-18 17:40:38 UTC ---

This bug has been added to advisory RHBA-2020:53494 by Ivan Devat (idevat)

--- Additional comment from errata-xmlrpc on 2020-06-18 17:40:39 UTC ---

Bug report changed to ON_QA status by Errata System.
A QE request has been submitted for advisory RHBA-2020:53494-01
https://errata.devel.redhat.com/advisory/53494

--- Additional comment from Michal Mazourek on 2020-06-23 11:32:04 UTC ---

Result:
=======
As explained in comment 5, this patch alone will prevent only a subset of possible failures, but there is no more time to focus this problem in RHEL7, therefore this bug will be cloned into RHEL8. The clone will probably depend on bz1420298, as mentioned in comment 5. This patch also clarifies timeout message when deleting multiple resources. The problem with stuck resources after multi delete still occurs, but for the above reasons, marking as VERIFIED in pcs-0.9.169-3.el7.


Testing:
========

[root@virt-026 ~]# rpm -q pcs
pcs-0.9.169-3.el7.x86_64


1. Login to web UI: (e.g.: https://virt-026.cluster-qe.lab.eng.brq.redhat.com:2224)
2. Create a cluster and add it to web UI
3. Go to Resources page of the cluster
4. Add 30x ocf:heartbeat Dummy (running, no fails)
5. Select them all and try to remove them (without Enforce removal)

> Repeating point 4 and 5 several times, since the problem isn't 100 % reproducible

# cases:
a) Resources got removed and no warning message appears
> OK

b) Warning message appears with changed text: 'Operation takes longer to complete than expected but it will continue in the background.' and the resources got removed.
> OK, the message is more clear now. Sometimes it takes some time (minutes) to remove the resources, but it will eventually

c) Warning message appears and most of the resources got removed in minutes. Few resources are disabled, but not removed after 20 minutes.
> Need to focus this issue in the clone

# case, when no resource got deleted wasn't reproduced.


6. Adding these resources with random parameters (such as '123'), which will generate fails
	ocf:heartbeat:aliyun-vpc-move-ip, 
	ocf:heartbeat:apache,
	ocf:heartbeat:aws-vpc-move-ip, 
	ocf:heartbeat:aws-vpc-route53, 
	ocf:heartbeat:awseip
7. Select them all and try to remove them (without Enforce removal)

> Repeating point 6 and 7 several times, since the problem isn't 100 % reproducible

# cases:
a) Resources got removed and no warning message appears
> OK

b) Warning message appeared with changed text: 'Operation takes longer to complete than expected but it will continue in the background.' and the resources got deleted (sometimes after a few minutes)
> OK, this was the most frequent case

# case, when any resource got stucked wasn't reproduced


8. Adding 10 random resources with random parameters
        ocf:heartbeat:aliyun-vpc-move-ip,
        ocf:heartbeat:apache,
        ocf:heartbeat:aws-vpc-move-ip,
        ocf:heartbeat:aws-vpc-route53,
        ocf:heartbeat:awseip
	ocf:heartbeat:awsvip
	ocf:heartbeat:azure-events
	ocf:heartbeat:azure-lb
	ocf:heartbeat:clvm
	ocf:heartbeat:conntrackd
9. Select them all and try to remove them (without Enforce removal)

# case: All resources got disabled with timeout message, but not deleted (after 2+ hours).
> Need to focus this issue in the clone

Comment 3 RHEL Program Management 2021-12-23 07:27:00 UTC

After evaluating this issue, there are no plans to address it further or fix it in an upcoming release.  Therefore, it is being closed.  If plans change such that this issue will be fixed in an upcoming release, then the bug can be reopened.

Note You need to log in before you can comment on or make changes to this bug.