Bug 1831775
Summary: | pacemaker-fenced got stuck with 100% CPU inside stonith_choose_peer() | ||
---|---|---|---|
Product: | Red Hat Enterprise Linux 8 | Reporter: | Michele Baldessari <michele> |
Component: | pacemaker | Assignee: | Ken Gaillot <kgaillot> |
Status: | CLOSED ERRATA | QA Contact: | pkomarov |
Severity: | medium | Docs Contact: | |
Priority: | high | ||
Version: | 8.2 | CC: | cluster-maint, jeckersb, pkomarov |
Target Milestone: | rc | ||
Target Release: | 8.3 | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | pacemaker-2.0.4-2.el8 | Doc Type: | Bug Fix |
Doc Text: |
Cause: Previously, Pacemaker assumed that a fencing topology level used to unfence a node would still exist when the unfencing result arrived.
Consequence: If a target node had a fencing topology configured with a single level containing a single device, and that level was removed from the configuration after an unfencing action had been initiated but before the result came back, the Pacemaker fencer daemon on the Designated Controller (DC) node could go into an infinite loop, consuming all CPU and not responding to any further requests.
Fix: Pacemaker now checks whether the fencing topology still exists when a result comes back.
Result: The result is recorded properly, and the fencer daemon continues to behave normally.
|
Story Points: | --- |
Clone Of: | Environment: | ||
Last Closed: | 2020-11-04 04:00:53 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Michele Baldessari
2020-05-05 15:36:14 UTC
Which node had the issue, and do you know around what time? It was controller-1 that had the fence at 100% and the fenced process was still stuck in a loop at sosreport collection time. The fencing device for controller-0 was removed from the configuration while controller-0 was being unfenced. The fencer initiated the unfencing while the device still existed, but the device was gone when the results came back. It then tried to check whether there were any more devices in the topology, and ended up in the infinite loop in stonith_choose_peer(). May 05 13:25:05 controller-1.redhat.local pacemaker-based [62591] (cib_perform_op) info: ++ /cib/configuration/fencing-topology: <fencing-level devices="stonith-fence_ipmilan-5254002c4be9,stonith-fence_compute-fence-nova" index="1" target="controller-0" id="fl-controller-0-1"/> ... May 05 13:25:39 controller-1.redhat.local pacemaker-fenced [62592] (handle_request) notice: Client pacemaker-controld.62596.af2c977b wants to fence (on) 'controller-0' with device '(any)' ... May 05 13:25:39 controller-1.redhat.local pacemaker-fenced [62592] (can_fence_host_with_device) notice: stonith-fence_ipmilan-5254002c4be9 is eligible to fence (on) controller-0: static-list ... May 05 13:25:49 controller-1.redhat.local pacemaker-based [62591] (cib_perform_op) info: -- /cib/configuration/fencing-topology/fencing-level[@id='fl-controller-0-1'] ... May 05 13:25:50 controller-1.redhat.local pacemaker-fenced [62592] (call_remote_stonith) notice: Requesting that database-2 perform 'on' action targeting controller-0 using 'stonith-fence_ipmilan-5254002c4be9' | for client pacemaker-controld.62596 (72s) ... May 05 13:25:50 controller-1.redhat.local pacemaker-fenced [62592] (process_remote_stonith_exec) notice: Action 'on' targeting controller-0 using stonith-fence_ipmilan-5254002c4be9 on behalf of pacemaker-controld.62596@controller-1: OK | rc=0 That was the final message from the fencer. The workaround until a fix is ready is to not remove fence devices while they're being used. I am able to reproduce this issue on RHEL 8.2. Interestingly, I am not able to reproduce it on RHEL 7, even with the same pacemaker code base, so it may be dependent on some external factor such as glib or corosync version. The issue only occurs if the target node has a single fencing topology level containing a single fencing device, and the level is removed from the configuration after an unfencing action has been requested and before it has completed. QA: The easiest reproducer is: 1. Configure a cluster, and choose one node to be the target of the test (referred to as $TARGET below). 2. On every node, make fence_dummy available for use: /usr/libexec/pacemaker/cts-support install 3. Create a fencing topology for the target consisting of a dummy fence device with a delay (the delay gives time to remove the level while an operation is in progress; the target should not have any other fencing configured): pcs stonith create target-fencing fence_dummy mode=pass delay=4 mock_dynamic_hosts=$TARGET meta provides=unfencing pcs stonith level add 1 $TARGET target-fencing 4. Changing the dummy device's parameters is the easiest way to trigger unfencing. It appears only the node running the device will be unfenced (which is likely a separate, unrelated bug), so ensure the device runs on the target: pcs constraint location target-fencing prefers $TARGET=100 5. Trigger unfencing by changing the delay parameter (here, it's from 4 to 3, but any change will do), then remove the topology: pcs stonith update target-fencing delay=3 && sleep 1 && pcs stonith level remove 1 $TARGET Michele, are you OK with 8.3 for the fix, or do you want an 8.2.z-stream as well? I think we should be fine for rhel 8.3. Once we have the crm_node fix puppet won't go around changing stonith resources and/or levels. Thanks, Michele Fixed upstream by commit 6d15ee56 Verified, new package pass basic HA regression test Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (pacemaker bug fix and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2020:4804 |