Bug 1843593
| Summary: | [GUI] pcsd is sometimes unable to remove resources | |||
|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux 7 | Reporter: | Nina Hostakova <nhostako> | |
| Component: | pcs | Assignee: | Ondrej Mular <omular> | |
| Status: | CLOSED ERRATA | QA Contact: | cluster-qe <cluster-qe> | |
| Severity: | unspecified | Docs Contact: | ||
| Priority: | unspecified | |||
| Version: | 7.9 | CC: | cfeist, cluster-maint, idevat, mlisik, mmazoure, mpospisi, omular, pvlasin, tojeline | |
| Target Milestone: | rc | |||
| Target Release: | --- | |||
| Hardware: | Unspecified | |||
| OS: | Unspecified | |||
| Whiteboard: | ||||
| Fixed In Version: | pcs-0.9.169-3.el7 | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | ||
| Clone Of: | ||||
| : | 1849994 (view as bug list) | Environment: | ||
| Last Closed: | 2020-09-29 20:10:26 UTC | Type: | Bug | |
| Regression: | --- | Mount Type: | --- | |
| Documentation: | --- | CRM: | ||
| Verified Versions: | Category: | --- | ||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
| Cloudforms Team: | --- | Target Upstream Version: | ||
| Embargoed: | ||||
Just for a reference, part of this was previously addressed in bz#1346852 and bz#1579911 After hours and hours of trying to reproduce this issue, investigating possible cause and reading through logs provided by QE, we've observed that the issue is reproducible mainly in these conditions: * machines are under heavy load * removed resources have failed actions We've been able to identify and fix a possible race-condition when removing a resource with failed actions. Upstream patch: https://github.com/ClusterLabs/pcs/commit/b78e5b448e4a2dc2ad1e4df45515ceb01f6f103e This patch alone will prevent only a subset of possible failures and there is still a chance of a different race-condition (reproducible on machines under heavy load) which we are unable to prevent without complete overhaul of `pcs resource delete` command tracked in bz#1420298 Result: ======= As explained in comment 5, this patch alone will prevent only a subset of possible failures, but there is no more time to focus this problem in RHEL7, therefore this bug will be cloned into RHEL8. The clone will probably depend on bz1420298, as mentioned in comment 5. This patch also clarifies timeout message when deleting multiple resources. The problem with stuck resources after multi delete still occurs, but for the above reasons, marking as VERIFIED in pcs-0.9.169-3.el7. Testing: ======== [root@virt-026 ~]# rpm -q pcs pcs-0.9.169-3.el7.x86_64 1. Login to web UI: (e.g.: https://virt-026.cluster-qe.lab.eng.brq.redhat.com:2224) 2. Create a cluster and add it to web UI 3. Go to Resources page of the cluster 4. Add 30x ocf:heartbeat Dummy (running, no fails) 5. Select them all and try to remove them (without Enforce removal) > Repeating point 4 and 5 several times, since the problem isn't 100 % reproducible # cases: a) Resources got removed and no warning message appears > OK b) Warning message appears with changed text: 'Operation takes longer to complete than expected but it will continue in the background.' and the resources got removed. > OK, the message is more clear now. Sometimes it takes some time (minutes) to remove the resources, but it will eventually c) Warning message appears and most of the resources got removed in minutes. Few resources are disabled, but not removed after 20 minutes. > Need to focus this issue in the clone # case, when no resource got deleted wasn't reproduced. 6. Adding these resources with random parameters (such as '123'), which will generate fails ocf:heartbeat:aliyun-vpc-move-ip, ocf:heartbeat:apache, ocf:heartbeat:aws-vpc-move-ip, ocf:heartbeat:aws-vpc-route53, ocf:heartbeat:awseip 7. Select them all and try to remove them (without Enforce removal) > Repeating point 6 and 7 several times, since the problem isn't 100 % reproducible # cases: a) Resources got removed and no warning message appears > OK b) Warning message appeared with changed text: 'Operation takes longer to complete than expected but it will continue in the background.' and the resources got deleted (sometimes after a few minutes) > OK, this was the most frequent case # case, when any resource got stucked wasn't reproduced 8. Adding 10 random resources with random parameters ocf:heartbeat:aliyun-vpc-move-ip, ocf:heartbeat:apache, ocf:heartbeat:aws-vpc-move-ip, ocf:heartbeat:aws-vpc-route53, ocf:heartbeat:awseip ocf:heartbeat:awsvip ocf:heartbeat:azure-events ocf:heartbeat:azure-lb ocf:heartbeat:clvm ocf:heartbeat:conntrackd 9. Select them all and try to remove them (without Enforce removal) # case: All resources got disabled with timeout message, but not deleted (after 2+ hours). > Need to focus this issue in the clone Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (pcs bug fix and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:3964 |
Description of problem: When trying to remove resources via web UI, the action may in some cases finish with this timeout message: 'Operation takes longer to complete than expected'. The resources might not be removed after all, but the behavior varies (see bellow). Version-Release number of selected component (if applicable): pcs-0.9.169-2.el7.x86_64 How reproducible: Reproduced on virtual (more often) as well as physical machines, but not always. Steps to Reproduce: Despite there are not reliable steps to reproduce every time, these were the most frequently problematic: 1.a) in web UI, add multiple resources of different kinds (with required parameters set to 123) e.g: ocf:heartbeat:aliyun-vpc-move-ip, ocf:heartbeat:apache, ocf:heartbeat:aws-vpc-move-ip, ocf:heartbeat:aws-vpc-route53, ocf:heartbeat:awseip) 1.b) try to remove 1, multiple or all of them 2.a) add multiple resources of the same type e.g: 15x ocf:heartbeat:apache 2.b) try to remove all of them Actual results: We encountered these scenarios: 1. Resources are removed without any problem 2. Timeout message and the resources are removed 3. Timeout message and resources are removed in a longer time (~ minutes) 4. Timeout message and resources are not removed (in a relevant amount of time) 5. Timeout message and only a part of the resources is removed In case the resources are not removed, they become disabled but still present also in CLI Expected results: The resources should be removed upon the action is taken. In case there is a problem (e.g. the timeout), it would be appropriate to display a message that some of the resources cannot be removed or the removal will take additional time.