Bug 1843593
Summary: | [GUI] pcsd is sometimes unable to remove resources | |||
---|---|---|---|---|
Product: | Red Hat Enterprise Linux 7 | Reporter: | Nina Hostakova <nhostako> | |
Component: | pcs | Assignee: | Ondrej Mular <omular> | |
Status: | CLOSED ERRATA | QA Contact: | cluster-qe <cluster-qe> | |
Severity: | unspecified | Docs Contact: | ||
Priority: | unspecified | |||
Version: | 7.9 | CC: | cfeist, cluster-maint, idevat, mlisik, mmazoure, mpospisi, omular, pvlasin, tojeline | |
Target Milestone: | rc | |||
Target Release: | --- | |||
Hardware: | Unspecified | |||
OS: | Unspecified | |||
Whiteboard: | ||||
Fixed In Version: | pcs-0.9.169-3.el7 | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | ||
Clone Of: | ||||
: | 1849994 (view as bug list) | Environment: | ||
Last Closed: | 2020-09-29 20:10:26 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: |
Description
Nina Hostakova
2020-06-03 15:40:41 UTC
Just for a reference, part of this was previously addressed in bz#1346852 and bz#1579911 After hours and hours of trying to reproduce this issue, investigating possible cause and reading through logs provided by QE, we've observed that the issue is reproducible mainly in these conditions: * machines are under heavy load * removed resources have failed actions We've been able to identify and fix a possible race-condition when removing a resource with failed actions. Upstream patch: https://github.com/ClusterLabs/pcs/commit/b78e5b448e4a2dc2ad1e4df45515ceb01f6f103e This patch alone will prevent only a subset of possible failures and there is still a chance of a different race-condition (reproducible on machines under heavy load) which we are unable to prevent without complete overhaul of `pcs resource delete` command tracked in bz#1420298 Result: ======= As explained in comment 5, this patch alone will prevent only a subset of possible failures, but there is no more time to focus this problem in RHEL7, therefore this bug will be cloned into RHEL8. The clone will probably depend on bz1420298, as mentioned in comment 5. This patch also clarifies timeout message when deleting multiple resources. The problem with stuck resources after multi delete still occurs, but for the above reasons, marking as VERIFIED in pcs-0.9.169-3.el7. Testing: ======== [root@virt-026 ~]# rpm -q pcs pcs-0.9.169-3.el7.x86_64 1. Login to web UI: (e.g.: https://virt-026.cluster-qe.lab.eng.brq.redhat.com:2224) 2. Create a cluster and add it to web UI 3. Go to Resources page of the cluster 4. Add 30x ocf:heartbeat Dummy (running, no fails) 5. Select them all and try to remove them (without Enforce removal) > Repeating point 4 and 5 several times, since the problem isn't 100 % reproducible # cases: a) Resources got removed and no warning message appears > OK b) Warning message appears with changed text: 'Operation takes longer to complete than expected but it will continue in the background.' and the resources got removed. > OK, the message is more clear now. Sometimes it takes some time (minutes) to remove the resources, but it will eventually c) Warning message appears and most of the resources got removed in minutes. Few resources are disabled, but not removed after 20 minutes. > Need to focus this issue in the clone # case, when no resource got deleted wasn't reproduced. 6. Adding these resources with random parameters (such as '123'), which will generate fails ocf:heartbeat:aliyun-vpc-move-ip, ocf:heartbeat:apache, ocf:heartbeat:aws-vpc-move-ip, ocf:heartbeat:aws-vpc-route53, ocf:heartbeat:awseip 7. Select them all and try to remove them (without Enforce removal) > Repeating point 6 and 7 several times, since the problem isn't 100 % reproducible # cases: a) Resources got removed and no warning message appears > OK b) Warning message appeared with changed text: 'Operation takes longer to complete than expected but it will continue in the background.' and the resources got deleted (sometimes after a few minutes) > OK, this was the most frequent case # case, when any resource got stucked wasn't reproduced 8. Adding 10 random resources with random parameters ocf:heartbeat:aliyun-vpc-move-ip, ocf:heartbeat:apache, ocf:heartbeat:aws-vpc-move-ip, ocf:heartbeat:aws-vpc-route53, ocf:heartbeat:awseip ocf:heartbeat:awsvip ocf:heartbeat:azure-events ocf:heartbeat:azure-lb ocf:heartbeat:clvm ocf:heartbeat:conntrackd 9. Select them all and try to remove them (without Enforce removal) # case: All resources got disabled with timeout message, but not deleted (after 2+ hours). > Need to focus this issue in the clone Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (pcs bug fix and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:3964 |