Bug 1843593

Summary: [GUI] pcsd is sometimes unable to remove resources
Product: Red Hat Enterprise Linux 7 Reporter: Nina Hostakova <nhostako>
Component: pcsAssignee: Ondrej Mular <omular>
Status: CLOSED ERRATA QA Contact: cluster-qe <cluster-qe>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 7.9CC: cfeist, cluster-maint, idevat, mlisik, mmazoure, mpospisi, omular, pvlasin, tojeline
Target Milestone: rc   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: pcs-0.9.169-3.el7 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1849994 (view as bug list) Environment:
Last Closed: 2020-09-29 20:10:26 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Nina Hostakova 2020-06-03 15:40:41 UTC
Description of problem:
When trying to remove resources via web UI, the action may in some cases finish with this timeout message: 'Operation takes longer to complete than expected'. The resources might not be removed after all, but the behavior varies (see bellow).

Version-Release number of selected component (if applicable):
pcs-0.9.169-2.el7.x86_64

How reproducible:
Reproduced on virtual (more often) as well as physical machines, but not always. 

Steps to Reproduce:
Despite there are not reliable steps to reproduce every time, these were the most frequently problematic:

1.a) in web UI, add multiple resources of different kinds (with required parameters set to 123)
e.g: ocf:heartbeat:aliyun-vpc-move-ip, 
     ocf:heartbeat:apache,
     ocf:heartbeat:aws-vpc-move-ip, 
     ocf:heartbeat:aws-vpc-route53, 
     ocf:heartbeat:awseip)

1.b) try to remove 1, multiple or all of them

2.a) add multiple resources of the same type 
e.g: 15x ocf:heartbeat:apache

2.b) try to remove all of them

Actual results:
We encountered these scenarios:
1. Resources are removed without any problem
2. Timeout message and the resources are removed
3. Timeout message and resources are removed in a longer time (~ minutes)
4. Timeout message and resources are not removed (in a relevant amount of time)
5. Timeout message and only a part of the resources is removed

In case the resources are not removed, they become disabled but still present also in CLI

Expected results:
The resources should be removed upon the action is taken. In case there is a problem (e.g. the timeout), it would be appropriate to display a message that some of the resources cannot be removed or the removal will take additional time.

Comment 2 Ondrej Mular 2020-06-10 13:30:48 UTC
Just for a reference, part of this was previously addressed in bz#1346852 and bz#1579911

Comment 5 Ondrej Mular 2020-06-16 12:41:47 UTC
After hours and hours of trying to reproduce this issue, investigating possible cause and reading through logs provided by QE, we've observed that the issue is reproducible mainly in these conditions:
 * machines are under heavy load
 * removed resources have failed actions

We've been able to identify and fix a possible race-condition when removing a resource with failed actions. Upstream patch: https://github.com/ClusterLabs/pcs/commit/b78e5b448e4a2dc2ad1e4df45515ceb01f6f103e

This patch alone will prevent only a subset of possible failures and there is still a chance of a different race-condition (reproducible on machines under heavy load) which we are unable to prevent without complete overhaul of `pcs resource delete` command tracked in bz#1420298

Comment 9 Ondrej Mular 2020-06-18 09:09:50 UTC
Upstream patch linked in comment#5

Comment 13 Michal Mazourek 2020-06-23 11:32:04 UTC
Result:
=======
As explained in comment 5, this patch alone will prevent only a subset of possible failures, but there is no more time to focus this problem in RHEL7, therefore this bug will be cloned into RHEL8. The clone will probably depend on bz1420298, as mentioned in comment 5. This patch also clarifies timeout message when deleting multiple resources. The problem with stuck resources after multi delete still occurs, but for the above reasons, marking as VERIFIED in pcs-0.9.169-3.el7.


Testing:
========

[root@virt-026 ~]# rpm -q pcs
pcs-0.9.169-3.el7.x86_64


1. Login to web UI: (e.g.: https://virt-026.cluster-qe.lab.eng.brq.redhat.com:2224)
2. Create a cluster and add it to web UI
3. Go to Resources page of the cluster
4. Add 30x ocf:heartbeat Dummy (running, no fails)
5. Select them all and try to remove them (without Enforce removal)

> Repeating point 4 and 5 several times, since the problem isn't 100 % reproducible

# cases:
a) Resources got removed and no warning message appears
> OK

b) Warning message appears with changed text: 'Operation takes longer to complete than expected but it will continue in the background.' and the resources got removed.
> OK, the message is more clear now. Sometimes it takes some time (minutes) to remove the resources, but it will eventually

c) Warning message appears and most of the resources got removed in minutes. Few resources are disabled, but not removed after 20 minutes.
> Need to focus this issue in the clone

# case, when no resource got deleted wasn't reproduced.


6. Adding these resources with random parameters (such as '123'), which will generate fails
	ocf:heartbeat:aliyun-vpc-move-ip, 
	ocf:heartbeat:apache,
	ocf:heartbeat:aws-vpc-move-ip, 
	ocf:heartbeat:aws-vpc-route53, 
	ocf:heartbeat:awseip
7. Select them all and try to remove them (without Enforce removal)

> Repeating point 6 and 7 several times, since the problem isn't 100 % reproducible

# cases:
a) Resources got removed and no warning message appears
> OK

b) Warning message appeared with changed text: 'Operation takes longer to complete than expected but it will continue in the background.' and the resources got deleted (sometimes after a few minutes)
> OK, this was the most frequent case

# case, when any resource got stucked wasn't reproduced


8. Adding 10 random resources with random parameters
        ocf:heartbeat:aliyun-vpc-move-ip,
        ocf:heartbeat:apache,
        ocf:heartbeat:aws-vpc-move-ip,
        ocf:heartbeat:aws-vpc-route53,
        ocf:heartbeat:awseip
	ocf:heartbeat:awsvip
	ocf:heartbeat:azure-events
	ocf:heartbeat:azure-lb
	ocf:heartbeat:clvm
	ocf:heartbeat:conntrackd
9. Select them all and try to remove them (without Enforce removal)

# case: All resources got disabled with timeout message, but not deleted (after 2+ hours).
> Need to focus this issue in the clone

Comment 15 errata-xmlrpc 2020-09-29 20:10:26 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (pcs bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:3964