Bug 1708378
Summary: | pacemaker-fenced dying while a fence-action is pending leaves behind pending actions | |||
---|---|---|---|---|
Product: | Red Hat Enterprise Linux 8 | Reporter: | Klaus Wenninger <kwenning> | |
Component: | pacemaker | Assignee: | Klaus Wenninger <kwenning> | |
Status: | CLOSED ERRATA | QA Contact: | cluster-qe <cluster-qe> | |
Severity: | low | Docs Contact: | ||
Priority: | high | |||
Version: | 8.0 | CC: | abeekhof, cluster-maint, kgaillot, phagara | |
Target Milestone: | pre-dev-freeze | |||
Target Release: | 8.1 | |||
Hardware: | Unspecified | |||
OS: | Unspecified | |||
Whiteboard: | ||||
Fixed In Version: | pacemaker-2.0.2-2.el8 | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | ||
Clone Of: | ||||
: | 1708380 (view as bug list) | Environment: | ||
Last Closed: | 2019-11-05 20:57:48 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: |
Description
Klaus Wenninger
2019-05-09 17:59:59 UTC
qa-ack+, reproducer in bug description A very little intrusive solution is whenever a fenced-instance gets synced in a pending-fence-action where it is supposed to be the owner (e.g. after a restart of fenced after a segfault) to make this pending-fence-action fail. Before the fix these hanging (not really hanging but in the history it looks as if) pending-fence-actions could accumulate in a cluster over time (synced back and forth between the nodes). With a fix as of above even these historical entries would disappear once pacemaker on all cluster nodes (latest - in fact it is enough to update the nodes that are originators of these entries) has been updated to the fix-version. Thus there shouldn't be a need to extend stonith_admin in a way that it allows erasing pending actions (--force comes in mind - potentially dangerous as the history-list isn't just history but in case of pending-actions it is essential for fencing-behaviour). This has the additional advantage that there is no need for support in pcs. environment: 3-node cluster with properly configured fencing before (2.0.1-5.el8) ==================== > [root@virt-030 ~]# crm_mon -1 > Stack: corosync > Current DC: virt-031 (version 2.0.1-5.el8-0eb7991564) - partition with quorum > Last updated: Wed Aug 21 14:14:30 2019 > Last change: Wed Aug 21 13:46:31 2019 by root via cibadmin on virt-030 > > 3 nodes configured > 3 resources configured > > Online: [ virt-030 virt-031 virt-032 ] > > Active resources: > > fence-virt-030 (stonith:fence_xvm): Started virt-030 > fence-virt-031 (stonith:fence_xvm): Started virt-031 > fence-virt-032 (stonith:fence_xvm): Started virt-032 add delay to one node's fencing resource: > [root@virt-030 ~]# pcs stonith update fence-virt-032 delay=20 request a reboot of the node with delayed fencing: > [root@virt-030 ~]# stonith_admin --reboot virt-032 wait ~10 seconds and kill both pacemaker-fenced and the actual fence agent (fence_xvm here) on the DC (virt-031): > [root@virt-031 ~]# killall -9 pacemaker-fenced fence_xvm wait for the fenced node to rejoin cluster (watch crm_mon on both remaining nodes), notice the stonith_admin command timed out: > [root@virt-030 ~]# stonith_admin --reboot virt-032 > [root@virt-030 ~]# echo $? > 124 the DC node will start with empty fencing history when pacemaker-fenced is restarted on it, after a minute or two the history is synced from the other node: > [root@virt-031 ~]# crm_mon -m3 -1 > Stack: corosync > Current DC: virt-031 (version 2.0.1-5.el8-0eb7991564) - partition with quorum > Last updated: Wed Aug 21 14:37:47 2019 > Last change: Wed Aug 21 14:34:08 2019 by hacluster via crmd on virt-031 > > 3 nodes configured > 3 resources configured > > Online: [ virt-030 virt-031 virt-032 ] > > Active resources: > > fence-virt-030 (stonith:fence_xvm): Started virt-030 > fence-virt-031 (stonith:fence_xvm): Started virt-031 > fence-virt-032 (stonith:fence_xvm): Started virt-032 > > Failed Resource Actions: > * fence-virt-031_monitor_60000 on virt-031 'unknown error' (1): call=37, status=Error, exitreason='' > > Fencing History: > * reboot of virt-032 pending: client=stonith_admin.18076, origin=virt-031 > * reboot of virt-032 successful: delegate=virt-030, client=pacemaker-controld.15586, origin=virt-031, > completed='Wed Aug 21 14:36:10 2019' Result: "pending" stonith action stuck in fencing history, even if it was delegated to and successfully performed by another node after pacemaker-fenced was killed on the DC while fencing was in progress. after (2.0.2-2.el8) =================== > [root@virt-037 ~]# crm_mon -1 > Stack: corosync > Current DC: virt-040 (version 2.0.2-2.el8-744a30d655) - partition with quorum > Last updated: Wed Aug 21 14:51:59 2019 > Last change: Wed Aug 21 13:02:55 2019 by root via cibadmin on virt-037 > > 3 nodes configured > 3 resources configured > > Online: [ virt-037 virt-038 virt-040 ] > > Active resources: > > fence-virt-037 (stonith:fence_xvm): Started virt-037 > fence-virt-038 (stonith:fence_xvm): Started virt-038 > fence-virt-040 (stonith:fence_xvm): Started virt-040 add delay to one node's fencing resource: > [root@virt-037 ~]# pcs stonith update fence-virt-038 delay=20 request a reboot of the node with delayed fencing: > [root@virt-037 ~]# stonith_admin --reboot virt-038 wait ~10 seconds and kill both pacemaker-fenced and the actual fence agent (fence_xvm here) on the DC (virt-040): > [root@virt-040 ~]# killall -9 pacemaker-fenced fence_xvm wait for the fenced node to rejoin cluster (watch crm_mon on both remaining nodes), notice the stonith_admin command timed out: > [root@virt-037 ~]# stonith_admin --reboot virt-038 > [root@virt-037 ~]# echo $? > 124 the DC node will start with empty fencing history when pacemaker-fenced is restarted on it, after a minute or two the history is synced from the other node: > [root@virt-040 ~]# crm_mon -m3 -1 > Stack: corosync > Current DC: virt-040 (version 2.0.2-2.el8-744a30d655) - partition with quorum > Last updated: Wed Aug 21 15:05:52 2019 > Last change: Wed Aug 21 14:53:19 2019 by root via cibadmin on virt-037 > > 3 nodes configured > 3 resources configured > > Online: [ virt-037 virt-038 virt-040 ] > > Active resources: > > fence-virt-037 (stonith:fence_xvm): Started virt-037 > fence-virt-038 (stonith:fence_xvm): Started virt-040 > fence-virt-040 (stonith:fence_xvm): Started virt-038 > > Failed Resource Actions: > * fence-virt-038_monitor_60000 on virt-040 'unknown error' (1): call=17, status=Error, exitreason='' > * fence-virt-040_monitor_60000 on virt-040 'unknown error' (1): call=15, status=Error, exitreason='' > > Failed Fencing Actions: > * reboot of virt-038 failed: delegate=virt-037, client=stonith_admin.18403, origin=virt-037, > completed='Wed Aug 21 15:01:57 2019' > * reboot of virt-038 failed: delegate=virt-037, client=stonith_admin.17854, origin=virt-037, > completed='Wed Aug 21 14:56:51 2019' > > Fencing History: > * reboot of virt-038 successful: delegate=virt-037, client=stonith_admin.18339, origin=virt-037, > completed='Wed Aug 21 14:57:55 2019' The output above was captured after multiple tries both with and without killing fence_xvm along with pacemaker-fenced -- when both were killed, fencing failed instead of being delegated to another online node (ie. the victim was not killed at all). Result: no orphaned pending action stuck in fencing history, fencing may simply fail instead of being delegated to another node. Klaus: is it OK that the fencing is not re-attempted after it fails on a delegate (the whole fencing request simply times out)? I've checked that this does not affect the cluster node failure recovery process, ie. panicking a node and killing pacemaker-fenced + fence_xvm on the DC still reboots the node (delegated to another node): > [root@virt-040 ~]# crm_mon -m3 -1 > Stack: corosync > Current DC: virt-040 (version 2.0.2-2.el8-744a30d655) - partition with quorum > Last updated: Wed Aug 21 15:27:07 2019 > Last change: Wed Aug 21 14:53:19 2019 by root via cibadmin on virt-037 > > 3 nodes configured > 3 resources configured > > Online: [ virt-037 virt-040 ] > OFFLINE: [ virt-038 ] > > Active resources: > > fence-virt-037 (stonith:fence_xvm): Started virt-037 > fence-virt-038 (stonith:fence_xvm): Started virt-040 > fence-virt-040 (stonith:fence_xvm): Started virt-037 > > Failed Resource Actions: > * fence-virt-040_monitor_60000 on virt-040 'unknown error' (1): call=15, status=Error, exitreason='' > * fence-virt-038_monitor_60000 on virt-040 'unknown error' (1): call=27, status=Error, exitreason='' > > Failed Fencing Actions: > * reboot of virt-038 failed: delegate=, client=pacemaker-controld.9232, origin=virt-040, > completed='Wed Aug 21 15:23:28 2019' > * reboot of virt-038 failed: delegate=virt-037, client=stonith_admin.19027, origin=virt-037, > completed='Wed Aug 21 15:21:38 2019' > * reboot of virt-038 failed: delegate=virt-037, client=stonith_admin.18403, origin=virt-037, > completed='Wed Aug 21 15:01:57 2019' > * reboot of virt-038 failed: delegate=virt-037, client=stonith_admin.17854, origin=virt-037, > completed='Wed Aug 21 14:56:51 2019' > > Fencing History: > * reboot of virt-038 successful: delegate=virt-037, client=pacemaker-controld.9232, origin=virt-040, > completed='Wed Aug 21 15:23:21 2019' > * reboot of virt-038 successful: delegate=virt-037, client=stonith_admin.18339, origin=virt-037, > completed='Wed Aug 21 14:57:55 2019' It's just that I would expect both administrator command and failure recovery to result in the same fencing behavior. Intention was not to introduce any change in the fencing-behaviour. Just the annoying pending actions should go away. Any other request regarding behaviour should be handled separately I guess. Klaus, I do think the fencing failure is a result of the fix. The recovered fencer now broadcasts a failure to all nodes. I'm not sure: whether the fencing actually failed, or is only being reported as failed; whether the request should be re-attempted by another node (if it's within the timeout); and whether the original request times out even if the recovery and failure happens within the timeout. (In reply to Ken Gaillot from comment #8) > Klaus, > > I do think the fencing failure is a result of the fix. The recovered fencer > now broadcasts a failure to all nodes. > > I'm not sure: whether the fencing actually failed, or is only being reported > as failed; whether the request should be re-attempted by another node (if > it's within the timeout); and whether the original request times out even if > the recovery and failure happens within the timeout. Sounds reasonable. Originally it should have failed as well via timeout - right. Guess a blind retry by some other fencer might be unpreferable. If it went OK the node might get unfenced meanwhile. In any case it sounds better to have the action rather failed overall and have the situation reevaluated by schedulerd. Isn't that roughly what we seem to get now? (In reply to Klaus Wenninger from comment #9) > (In reply to Ken Gaillot from comment #8) > > Klaus, > > > > I do think the fencing failure is a result of the fix. The recovered fencer > > now broadcasts a failure to all nodes. > > > > I'm not sure: whether the fencing actually failed, or is only being reported > > as failed; whether the request should be re-attempted by another node (if > > it's within the timeout); and whether the original request times out even if > > the recovery and failure happens within the timeout. > > Sounds reasonable. Originally it should have failed as well via timeout - > right. Right, I didn't pay attention to that, but the "before" run here did have stonith_admin time out (and of course status shows it as stuck in pending). > Guess a blind retry by some other fencer might be unpreferable. > If it went OK the node might get unfenced meanwhile. In any case it sounds > better > to have the action rather failed overall and have the situation reevaluated > by schedulerd. That does make sense to me. > Isn't that roughly what we seem to get now? I think so. :) To answer Patrik's question about user-initiated (stonith_admin) vs. cluster-initiated fencing: In most respects they are identical, but there are some differences. In the case of cluster-initiated fencing, what is happening is the controller is doing the equivalent of running stonith_admin, and then when it gets the timeout, retrying. So, it should be the same as if you manually retried stonith_admin after the first run timed out. (In reply to Ken Gaillot from comment #10) > > To answer Patrik's question about user-initiated (stonith_admin) vs. > cluster-initiated fencing: In most respects they are identical, but there > are some differences. In the case of cluster-initiated fencing, what is > happening is the controller is doing the equivalent of running > stonith_admin, and then when it gets the timeout, retrying. So, it should be > the same as if you manually retried stonith_admin after the first run timed > out. Timeouts aren't taken from the cib but default to 90s or 120s or whatever you specify. So that might influence effective behaviour as well. But there is already a BZ to make the timeout default to what is configured in the CIB. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:3385 |