Bug 1787749
Summary: | Pending self-fence actions remain after hard reset of the DC node [RHEL 7] | |||
---|---|---|---|---|
Product: | Red Hat Enterprise Linux 7 | Reporter: | Reid Wahl <nwahl> | |
Component: | pacemaker | Assignee: | Ken Gaillot <kgaillot> | |
Status: | CLOSED ERRATA | QA Contact: | cluster-qe <cluster-qe> | |
Severity: | medium | Docs Contact: | ||
Priority: | high | |||
Version: | 7.7 | CC: | cluster-maint, hbiswas, kgaillot, mnovacek, obenes, phagara, sanyadav | |
Target Milestone: | rc | |||
Target Release: | 7.9 | |||
Hardware: | All | |||
OS: | Linux | |||
Whiteboard: | ||||
Fixed In Version: | pacemaker-1.1.23-1.el7 | Doc Type: | Bug Fix | |
Doc Text: |
Cause: If the cluster node acting as DC scheduled its own fencing, it would track two copies of the fencing action, one of its own creation, and another from the executing node's resulting fence device query, but the fencing result would only complete one of them.
Consequence: If a DC fencing action failed, it would get "stuck" in status displays as a pending action and could not be cleaned up.
Fix: Pacemaker identifies the two operations as the same and tracks just one copy.
Result: Pending fencing actions for the DC no longer appear as pending in status once they complete, even if they fail.
|
Story Points: | --- | |
Clone Of: | ||||
: | 1787751 (view as bug list) | Environment: | ||
Last Closed: | 2020-09-29 20:03:57 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | ||||
Bug Blocks: | 1787751, 1870663 |
Description
Reid Wahl
2020-01-04 23:42:25 UTC
The pending action list in the example above shows two pending actions from the other node, so this might not require a self-fencing action to trigger the behavior. Not sure what all conditions can trigger it. This isn't just cosmetic: There is no way to manually clear pending actions - with good reason though as cleaning pending actions in general influences behaviour and not just history-logging. Given that history recording is limited to 500 entries pending actions piling up over time can lead to successes/failures not being recorded at all anymore. qa_ack+, reproducer in description re-qacking, as fix is expected to be merged upstream soon Fixed upstream by commit df71a07 in the 2.0 branch (which will be in RHEL 8.3 via Bug 1787751), backported to the 1.1 branch as commit cae1b8d (which will be in RHEL 7.9) Unable to reproduce issue, neither on 7.7 with 1.1.20-5.el7 nor on 7.8 with pacemaker-1.1.21-4.el7. The pending fence operation always gets automatically resolved by another (successful) fence attempt. I've tried multiple reproducer variations I could think of (sysrq-halting the DC node, killing stonithd, letting the fence op time out, multi-level fencing, ...), always the same result. @Ken: Please let me know if you think of some other way to reproduce this issue. I couldn't see anything obvious in the customer cases. For now, I've verified the fix is included in 7.9's pacemaker-1.1.23-1.el7 package and also tried to (unsuccessfully) reproduce the issue on this fixed version. Marking SanityOnly verified. @Patrik: I believe this will reproduce it: * Copy /usr/share/pacemaker/tests/cts/fence_dummy to /usr/sbin on all nodes * Configure a device using fence_dummy with mode=fail and delay=10 (it will always fail), with pcmk_host_list limited to a single target node * Configure the target node to have a fencing topology consisting of a single level with the dummy device as the first device and the real fencing device as the second * Ensure the target node becomes DC (e.g. by rebooting all other nodes) * Cause the target node to require fencing while still alive (e.g. by configuring a dummy resource with on-fail="fence" for the monitor, constrained to that node, and causing the monitor to fail) * Wait to see the pending fencing in pcs status * Hard-reset the node * Reconfigure the dummy fence device with mode=pass, and the node should be fenced successfully after that The reproducer in Comment 12 is incorrect. I can't remember how to reproduce an increasing list of "stuck" pending actions, but I am now able to reproduce a single "stuck" pending action. The key elements of the situation are: * There is a fencing topology for the DC node * Fencing is required for the DC (while the node remains DC, which means it has not left the cluster) * While the fencing is pending, the DC node leaves the cluster (but does *not* reboot or restart cluster services) * While the DC node is out of the cluster, the pending fencing fails * The DC node rejoins the cluster The reproducer I used was: 1. Configure a cluster of at least two nodes with a real fencing device named "fencing-real". 2. Copy /usr/share/pacemaker/tests/cts/fence_dummy to /usr/sbin on all nodes. 3. Using the current DC's node name instead of $DC: pcs stonith create fencing-dummy fence_dummy mode=fail delay=10 pcmk_host_list=$DC pcs stonith level add 1 $DC fencing-dummy fencing-real pcs resource create resource-dummy ocf:pacemaker:Dummy op monitor interval=10s on-fail=fence pcs constraint location resource-dummy prefers $DC 4. Have "crm_mon" running on some node other than $DC, and wait until all resources are started. 5. On $DC, make the dummy resource fail, which will trigger fencing: rm /run/Dummy-resource-dummy.state 6. Watch crm_mon and wait until it shows that the resource is failed and fencing of the DC is pending. While the fencing is still pending, block corosync on $DC to make the node leave the cluster: firewall-cmd --direct --add-rule ipv4 filter OUTPUT 2 -p udp --dport=5405 -j DROP 8. Watch crm_mon and wait until it shows that the DC fencing failed, and a new fencing operation is pending. 9. On $DC, unblock corosync so the node rejoins the cluster: firewall-cmd --direct --remove-rule ipv4 filter OUTPUT 2 -p udp --dport=5405 -j DROP 10. The old DC will be re-elected DC. Cluster status will look different on $DC and on any other node. $DC will reschedule fencing of itself but it will continue to fail at this point. 11. Change the dummy fencing device to succeed: pcs stonith update fencing-dummy mode=pass 12. Watch crm_mon and $DC, and wait (potentially a long time) until $DC reboots and rejoins the cluster (you may have to manually start the node after it is fenced and start cluster services depending on how you configured everything). With the old code, the fencing will still be listed as pending even though it successfully completed (and even if you run "pcs stonith history cleanup" to erase the history); with the new code, no pending fencing will be shown. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (pacemaker bug fix and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2020:3951 |