Bug 1665343
Summary: | Regression: Cleanup when operation in progress delays cluster response | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 7 | Reporter: | Ken Gaillot <kgaillot> | ||||||
Component: | pacemaker | Assignee: | Ken Gaillot <kgaillot> | ||||||
Status: | CLOSED ERRATA | QA Contact: | cluster-qe <cluster-qe> | ||||||
Severity: | urgent | Docs Contact: | |||||||
Priority: | urgent | ||||||||
Version: | 7.6 | CC: | abeekhof, aherr, cfeist, cluster-maint, mnovacek, phagara, pkomarov | ||||||
Target Milestone: | rc | Keywords: | Regression, ZStream | ||||||
Target Release: | 7.7 | ||||||||
Hardware: | Unspecified | ||||||||
OS: | Unspecified | ||||||||
Whiteboard: | |||||||||
Fixed In Version: | pacemaker-1.1.20-1.el7 | Doc Type: | Bug Fix | ||||||
Doc Text: |
Previously, the originator of the operation was not informed of its completion when a resource was cleaned up, refreshed, or removed from the configuration while an operation was in progress on that resource. As a consequence, the cluster was blocked from responding to new events for a couple of minutes, and the operation was then considered timed out. This sometimes led to fencing: for example, if the operation was a stop. With this update, the originator is now directly notified of the result, and the cluster is able to respond to new events.
|
Story Points: | --- | ||||||
Clone Of: | |||||||||
: | 1665816 (view as bug list) | Environment: | |||||||
Last Closed: | 2019-08-06 12:54:19 UTC | Type: | Bug | ||||||
Regression: | --- | Mount Type: | --- | ||||||
Documentation: | --- | CRM: | |||||||
Verified Versions: | Category: | --- | |||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||
Embargoed: | |||||||||
Bug Depends On: | |||||||||
Bug Blocks: | 1665816 | ||||||||
Attachments: |
|
Description
Ken Gaillot
2019-01-11 05:36:09 UTC
Fixed upstream by commit https://github.com/ClusterLabs/pacemaker/commit/a48d73f23c6800ae51522c1a91d0f0e1eb967078 Unable to reproduce the issue on 7.7-20190619.n.0 compose with pacemaker packages downgraded to 1.1.19-8.el7-c3c624ea3d. Reproduction steps taken: * 3-node virt cluster * one ocf:pacemaker:Dummy resource named "dummy" with op_sleep=10 * /run/Dummy-dummy.state resource state file removed on the node where it's running * while the dummy resource is in "FAILED" or "Started (Monitoring)" state, run "pcs resource cleanup dummy" * grepping /var/log/cluster/corosync.log for "Timer popped" and "Triggered assert" returns no matches, dummy resource is recovered successfully and no fencing occurs Ken, could you please confirm this version is supposed to be affected and/or provide a working reproducer? (In reply to Patrik Hagara from comment #8) > Unable to reproduce the issue on 7.7-20190619.n.0 compose with pacemaker > packages downgraded to 1.1.19-8.el7-c3c624ea3d. > > Reproduction steps taken: > * 3-node virt cluster > * one ocf:pacemaker:Dummy resource named "dummy" with op_sleep=10 > * /run/Dummy-dummy.state resource state file removed on the node where > it's running > * while the dummy resource is in "FAILED" or "Started (Monitoring)" state, > run "pcs resource cleanup dummy" The issue wouldn't show up in the "Started (Monitoring)" state, only "FAILED" or "Starting". It might be possible that cleaning up early enough in the "FAILED" state might not trigger the issue, so cleaning up in "Starting" would be ideal. > * grepping /var/log/cluster/corosync.log for "Timer popped" and "Triggered > assert" returns no matches, dummy resource is recovered successfully and no > fencing occurs Those messages should show up in the DC's logs, which might be different from the node running the resource. > Ken, could you please confirm this version is supposed to be affected and/or > provide a working reproducer? This BZ was z-streamed back to 7.6 (Bug 1665816) in pacemaker-1.1.19-8.el7_6.4, and QA verified there that the issue was present in 1.1.19-8.el7_6.2. If the above suggestions don't clear it up, something odd is going on. (In reply to Ken Gaillot from comment #9) > (In reply to Patrik Hagara from comment #8) > > Unable to reproduce the issue on 7.7-20190619.n.0 compose with pacemaker > > packages downgraded to 1.1.19-8.el7-c3c624ea3d. > > > > Reproduction steps taken: > > * 3-node virt cluster > > * one ocf:pacemaker:Dummy resource named "dummy" with op_sleep=10 > > * /run/Dummy-dummy.state resource state file removed on the node where > > it's running > > * while the dummy resource is in "FAILED" or "Started (Monitoring)" state, > > run "pcs resource cleanup dummy" > > The issue wouldn't show up in the "Started (Monitoring)" state, only > "FAILED" or "Starting". It might be possible that cleaning up early enough > in the "FAILED" state might not trigger the issue, so cleaning up in > "Starting" would be ideal. I've tried running cleanup at different times in the recovery process with no luck. All I saw was "FAILED" and "Started (Monitoring)" states, no "Starting" (except when the dummy resource was starting for the first time after creation). > > * grepping /var/log/cluster/corosync.log for "Timer popped" and "Triggered > > assert" returns no matches, dummy resource is recovered successfully and no > > fencing occurs > > Those messages should show up in the DC's logs, which might be different > from the node running the resource. No node had any of these lines in the corosync.log. > > Ken, could you please confirm this version is supposed to be affected and/or > > provide a working reproducer? > > This BZ was z-streamed back to 7.6 (Bug 1665816) in > pacemaker-1.1.19-8.el7_6.4, and QA verified there that the issue was present > in 1.1.19-8.el7_6.2. If the above suggestions don't clear it up, something > odd is going on. Attaching a pcs cluster report and corosync.log files. I can try to repro using the pre-fix 7.6.z packages, unless you have a better suggestion based on the report. Created attachment 1586619 [details]
pcs cluster report for failed reproducer
Created attachment 1586620 [details]
corosync.log files for failed reproducer
I think you may just be missing the 10-second window ... try bumping to op_sleep to 20 "FAILED" is the right state for cleanup Still unable to reproduce the issue using the unfixed packages. Martin Juricek <mjuricek>, who verified the 7.6 Z-stream BZ, also tried without success. Could it be that this issue was triggerable only in 7.6 Z-stream packages (starting with 1.1.19-8.el7_6.2 and then fixed in pacemaker-1.1.19-8.el7_6.4 as per the BZ clone)? I'll check that it's still not reproducible in this "fixed" package and mark this bug as verified, as we're already way past the 7.7 deadline. after fix (1.1.20-5.el7): ========================= > [root@virt-038 ~]# rpm -q pacemaker > pacemaker-1.1.20-5.el7.x86_64 > [root@virt-038 ~]# pcs resource create dummy ocf:pacemaker:Dummy op_sleep=10 wait for dummy to be started: > [root@virt-038 ~]# pcs status > Cluster name: STSRHTS540 > Stack: corosync > Current DC: virt-038 (version 1.1.20-5.el7-3c4c782f70) - partition with quorum > Last updated: Tue Jul 9 16:31:09 2019 > Last change: Tue Jul 9 16:20:01 2019 by root via cibadmin on virt-038 > > 3 nodes configured > 4 resources configured > > Online: [ virt-038 virt-040 virt-061 ] > > Full list of resources: > > fence-virt-038 (stonith:fence_xvm): Started virt-038 > fence-virt-040 (stonith:fence_xvm): Started virt-040 > fence-virt-061 (stonith:fence_xvm): Started virt-061 > dummy (ocf::pacemaker:Dummy): Started virt-038 > > Daemon Status: > corosync: active/disabled > pacemaker: active/disabled > pcsd: active/enabled on the node where dummy is started, delete the resource state files: > [root@virt-038 ~]# rm -f /run/Dummy-dummy.state* keep querying `pcs status` until dummy transitions to FAILED state. note the timestamp of the failure (last_rc_change of the failed monitor action). ~5 seconds after the failure timestamp, run cleanup of the dummy resource: > [root@virt-038 ~]# pcs resource cleanup dummy > Cleaned up dummy on virt-061 > Cleaned up dummy on virt-040 > Cleaned up dummy on virt-038 > Waiting for 1 reply from the CRMd. OK `pcs status` will now show the dummy resource as "stopped" for up to 10 seconds, then "starting" for 10 seconds, and "started (monitoring)" for another 10s. afterwards the resource will be successfully recovered. verify that the bug was not hit (should return no matches): > [root@virt-038 ~]# for node in virt-{038,040,061}; do ssh $node 'grep -iE "(Timer popped)|(Triggered assert)" /var/log/cluster/corosync.log'; done marking as verified in 1.1.20-5.el7 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:2129 |