Bug 1708378

Summary:	pacemaker-fenced dying while a fence-action is pending leaves behind pending actions
Product:	Red Hat Enterprise Linux 8	Reporter:	Klaus Wenninger <kwenning>
Component:	pacemaker	Assignee:	Klaus Wenninger <kwenning>
Status:	CLOSED ERRATA	QA Contact:	cluster-qe <cluster-qe>
Severity:	low	Docs Contact:
Priority:	high
Version:	8.0	CC:	abeekhof, cluster-maint, kgaillot, phagara
Target Milestone:	pre-dev-freeze
Target Release:	8.1
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	pacemaker-2.0.2-2.el8	Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:
Clones:	1708380 (view as bug list)		Environment:
Last Closed:	2019-11-05 20:57:48 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Klaus Wenninger 2019-05-09 17:59:59 UTC

Description of problem:

When fenced is e.g. segfaulting while this very instance has a pending fence-action this pending action stays in the history-list.
Unfortunately cleanup of fence-history using stonith_admin just takes care of failed and successful actions while there is no possibility to remove pending actions.
Fenced restarted after a segfault comes up with an empty fence-history and doesn't trigger syncing with the history known by the rest of the cluster.

Version-Release number of selected component (if applicable):

2.0.1-5.el8

How reproducible:

100%

Steps to Reproduce:
1. Use stonith_admin to trigger a fence-action that is ideally a little sluggish so that you have time to see it appear as pending with crm_mon
2. Issue 'killall -9 pacemaker-fenced' on the node the pending fence-action is carried out
3.

Actual results:

Viewed from a different node in the cluster the pending fence-action stays persistently.
Viewed from the node where fenced had just been killed all fence-actions are gone (as well the past failed or successful ones).
There is no way to purge the pending fence-action using stonith_admin.

Expected results:

After some timeout at least the pending fence-action (that is actually not pending anymore) should go away (respectively be converted into a failed fence-action).
If there is some leftover pending fence-action it would be preferable to have a force-option in stonith_admin to be able to remove that manually.
When fenced is restarted it should resync the fence-history with the other nodes.

Additional info:

Comment 1 Patrik Hagara 2019-05-14 13:03:31 UTC

qa-ack+, reproducer in bug description

Comment 2 Klaus Wenninger 2019-06-13 11:02:22 UTC

A very little intrusive solution is whenever a fenced-instance gets synced in
a pending-fence-action where it is supposed to be the owner (e.g. after a
restart of fenced after a segfault) to make this pending-fence-action fail.

Before the fix these hanging (not really hanging but in the history it looks
as if) pending-fence-actions could accumulate in a cluster over time (synced back
and forth between the nodes).
With a fix as of above even these historical entries would disappear once
pacemaker on all cluster nodes (latest - in fact it is enough to update
the nodes that are originators of these entries) has been updated to the 
fix-version.
Thus there shouldn't be a need to extend stonith_admin in a way that it allows
erasing pending actions (--force comes in mind - potentially dangerous as the
history-list isn't just history but in case of pending-actions it is essential
for fencing-behaviour). This has the additional advantage that there is no
need for support in pcs.

Comment 3 Klaus Wenninger 2019-06-13 11:10:38 UTC

https://github.com/ClusterLabs/pacemaker/pull/1805

Comment 5 Patrik Hagara 2019-08-21 13:31:52 UTC

environment: 3-node cluster with properly configured fencing


before (2.0.1-5.el8)
====================

> [root@virt-030 ~]# crm_mon -1
> Stack: corosync
> Current DC: virt-031 (version 2.0.1-5.el8-0eb7991564) - partition with quorum
> Last updated: Wed Aug 21 14:14:30 2019
> Last change: Wed Aug 21 13:46:31 2019 by root via cibadmin on virt-030
> 
> 3 nodes configured
> 3 resources configured
> 
> Online: [ virt-030 virt-031 virt-032 ]
> 
> Active resources:
> 
>  fence-virt-030	(stonith:fence_xvm):	Started virt-030
>  fence-virt-031	(stonith:fence_xvm):	Started virt-031
>  fence-virt-032	(stonith:fence_xvm):	Started virt-032

add delay to one node's fencing resource:

> [root@virt-030 ~]# pcs stonith update fence-virt-032 delay=20

request a reboot of the node with delayed fencing:

> [root@virt-030 ~]# stonith_admin --reboot virt-032

wait ~10 seconds and kill both pacemaker-fenced and the actual fence agent (fence_xvm here) on the DC (virt-031):

> [root@virt-031 ~]# killall -9 pacemaker-fenced fence_xvm

wait for the fenced node to rejoin cluster (watch crm_mon on both remaining nodes), notice the stonith_admin command timed out:

> [root@virt-030 ~]# stonith_admin --reboot virt-032
> [root@virt-030 ~]# echo $?
> 124

the DC node will start with empty fencing history when pacemaker-fenced is restarted on it, after a minute or two the history is synced from the other node:

> [root@virt-031 ~]# crm_mon -m3 -1
> Stack: corosync
> Current DC: virt-031 (version 2.0.1-5.el8-0eb7991564) - partition with quorum
> Last updated: Wed Aug 21 14:37:47 2019
> Last change: Wed Aug 21 14:34:08 2019 by hacluster via crmd on virt-031
> 
> 3 nodes configured
> 3 resources configured
> 
> Online: [ virt-030 virt-031 virt-032 ]
> 
> Active resources:
> 
> fence-virt-030  (stonith:fence_xvm):    Started virt-030
> fence-virt-031  (stonith:fence_xvm):    Started virt-031
> fence-virt-032  (stonith:fence_xvm):    Started virt-032
> 
> Failed Resource Actions:
> * fence-virt-031_monitor_60000 on virt-031 'unknown error' (1): call=37, status=Error, exitreason=''
> 
> Fencing History:
> * reboot of virt-032 pending: client=stonith_admin.18076, origin=virt-031
> * reboot of virt-032 successful: delegate=virt-030, client=pacemaker-controld.15586, origin=virt-031,
>     completed='Wed Aug 21 14:36:10 2019'

Result: "pending" stonith action stuck in fencing history, even if it was delegated to and successfully performed by another node after pacemaker-fenced was killed on the DC while fencing was in progress.


after (2.0.2-2.el8)
===================

> [root@virt-037 ~]# crm_mon -1
> Stack: corosync
> Current DC: virt-040 (version 2.0.2-2.el8-744a30d655) - partition with quorum
> Last updated: Wed Aug 21 14:51:59 2019
> Last change: Wed Aug 21 13:02:55 2019 by root via cibadmin on virt-037
> 
> 3 nodes configured
> 3 resources configured
> 
> Online: [ virt-037 virt-038 virt-040 ]
> 
> Active resources:
> 
>  fence-virt-037	(stonith:fence_xvm):	Started virt-037
>  fence-virt-038	(stonith:fence_xvm):	Started virt-038
>  fence-virt-040	(stonith:fence_xvm):	Started virt-040

add delay to one node's fencing resource:

> [root@virt-037 ~]# pcs stonith update fence-virt-038 delay=20

request a reboot of the node with delayed fencing:

> [root@virt-037 ~]# stonith_admin --reboot virt-038

wait ~10 seconds and kill both pacemaker-fenced and the actual fence agent (fence_xvm here) on the DC (virt-040):

> [root@virt-040 ~]# killall -9 pacemaker-fenced fence_xvm

wait for the fenced node to rejoin cluster (watch crm_mon on both remaining nodes), notice the stonith_admin command timed out:

> [root@virt-037 ~]# stonith_admin --reboot virt-038
> [root@virt-037 ~]# echo $?
> 124

the DC node will start with empty fencing history when pacemaker-fenced is restarted on it, after a minute or two the history is synced from the other node:

> [root@virt-040 ~]# crm_mon -m3 -1
> Stack: corosync
> Current DC: virt-040 (version 2.0.2-2.el8-744a30d655) - partition with quorum
> Last updated: Wed Aug 21 15:05:52 2019
> Last change: Wed Aug 21 14:53:19 2019 by root via cibadmin on virt-037
> 
> 3 nodes configured
> 3 resources configured
> 
> Online: [ virt-037 virt-038 virt-040 ]
> 
> Active resources:
> 
>  fence-virt-037	(stonith:fence_xvm):	Started virt-037
>  fence-virt-038	(stonith:fence_xvm):	Started virt-040
>  fence-virt-040	(stonith:fence_xvm):	Started virt-038
> 
> Failed Resource Actions:
> * fence-virt-038_monitor_60000 on virt-040 'unknown error' (1): call=17, status=Error, exitreason=''
> * fence-virt-040_monitor_60000 on virt-040 'unknown error' (1): call=15, status=Error, exitreason=''
> 
> Failed Fencing Actions:
> * reboot of virt-038 failed: delegate=virt-037, client=stonith_admin.18403, origin=virt-037,
>     completed='Wed Aug 21 15:01:57 2019'
> * reboot of virt-038 failed: delegate=virt-037, client=stonith_admin.17854, origin=virt-037,
>     completed='Wed Aug 21 14:56:51 2019'
> 
> Fencing History:
> * reboot of virt-038 successful: delegate=virt-037, client=stonith_admin.18339, origin=virt-037,
>     completed='Wed Aug 21 14:57:55 2019'

The output above was captured after multiple tries both with and without killing fence_xvm along with pacemaker-fenced -- when both were killed, fencing failed instead of being delegated to another online node (ie. the victim was not killed at all).

Result: no orphaned pending action stuck in fencing history, fencing may simply fail instead of being delegated to another node.

Klaus: is it OK that the fencing is not re-attempted after it fails on a delegate (the whole fencing request simply times out)? I've checked that this does not affect the cluster node failure recovery process, ie. panicking a node and killing pacemaker-fenced + fence_xvm on the DC still reboots the node (delegated to another node):

> [root@virt-040 ~]# crm_mon -m3 -1
> Stack: corosync
> Current DC: virt-040 (version 2.0.2-2.el8-744a30d655) - partition with quorum
> Last updated: Wed Aug 21 15:27:07 2019
> Last change: Wed Aug 21 14:53:19 2019 by root via cibadmin on virt-037
> 
> 3 nodes configured
> 3 resources configured
> 
> Online: [ virt-037 virt-040 ]
> OFFLINE: [ virt-038 ]
> 
> Active resources:
> 
>  fence-virt-037	(stonith:fence_xvm):	Started virt-037
>  fence-virt-038	(stonith:fence_xvm):	Started virt-040
>  fence-virt-040	(stonith:fence_xvm):	Started virt-037
> 
> Failed Resource Actions:
> * fence-virt-040_monitor_60000 on virt-040 'unknown error' (1): call=15, status=Error, exitreason=''
> * fence-virt-038_monitor_60000 on virt-040 'unknown error' (1): call=27, status=Error, exitreason=''
> 
> Failed Fencing Actions:
> * reboot of virt-038 failed: delegate=, client=pacemaker-controld.9232, origin=virt-040,
>     completed='Wed Aug 21 15:23:28 2019'
> * reboot of virt-038 failed: delegate=virt-037, client=stonith_admin.19027, origin=virt-037,
>     completed='Wed Aug 21 15:21:38 2019'
> * reboot of virt-038 failed: delegate=virt-037, client=stonith_admin.18403, origin=virt-037,
>     completed='Wed Aug 21 15:01:57 2019'
> * reboot of virt-038 failed: delegate=virt-037, client=stonith_admin.17854, origin=virt-037,
>     completed='Wed Aug 21 14:56:51 2019'
> 
> Fencing History:
> * reboot of virt-038 successful: delegate=virt-037, client=pacemaker-controld.9232, origin=virt-040,
>     completed='Wed Aug 21 15:23:21 2019'
> * reboot of virt-038 successful: delegate=virt-037, client=stonith_admin.18339, origin=virt-037,
>     completed='Wed Aug 21 14:57:55 2019'

It's just that I would expect both administrator command and failure recovery to result in the same fencing behavior.

Comment 6 Klaus Wenninger 2019-08-21 14:04:46 UTC

Intention was not to introduce any change in the fencing-behaviour.
Just the annoying pending actions should go away.
Any other request regarding behaviour should be handled separately I guess.

Comment 7 Patrik Hagara 2019-08-21 14:26:00 UTC

Marking verified in 2.0.2-2.el8 as per comment#5 and comment#6.

Comment 8 Ken Gaillot 2019-08-21 21:35:06 UTC

Klaus,

I do think the fencing failure is a result of the fix. The recovered fencer now broadcasts a failure to all nodes.

I'm not sure: whether the fencing actually failed, or is only being reported as failed; whether the request should be re-attempted by another node (if it's within the timeout); and whether the original request times out even if the recovery and failure happens within the timeout.

Comment 9 Klaus Wenninger 2019-08-21 21:51:03 UTC

(In reply to Ken Gaillot from comment #8)
> Klaus,
> 
> I do think the fencing failure is a result of the fix. The recovered fencer
> now broadcasts a failure to all nodes.
> 
> I'm not sure: whether the fencing actually failed, or is only being reported
> as failed; whether the request should be re-attempted by another node (if
> it's within the timeout); and whether the original request times out even if
> the recovery and failure happens within the timeout.

Sounds reasonable. Originally it should have failed as well via timeout - right.
Guess a blind retry by some other fencer might be unpreferable.
If it went OK the node might get unfenced meanwhile. In any case it sounds better
to have the action rather failed overall and have the situation reevaluated
by schedulerd.
Isn't that roughly what we seem to get now?

Comment 10 Ken Gaillot 2019-08-22 16:10:14 UTC

(In reply to Klaus Wenninger from comment #9)
> (In reply to Ken Gaillot from comment #8)
> > Klaus,
> > 
> > I do think the fencing failure is a result of the fix. The recovered fencer
> > now broadcasts a failure to all nodes.
> > 
> > I'm not sure: whether the fencing actually failed, or is only being reported
> > as failed; whether the request should be re-attempted by another node (if
> > it's within the timeout); and whether the original request times out even if
> > the recovery and failure happens within the timeout.
> 
> Sounds reasonable. Originally it should have failed as well via timeout -
> right.

Right, I didn't pay attention to that, but the "before" run here did have stonith_admin time out (and of course status shows it as stuck in pending).

> Guess a blind retry by some other fencer might be unpreferable.
> If it went OK the node might get unfenced meanwhile. In any case it sounds
> better
> to have the action rather failed overall and have the situation reevaluated
> by schedulerd.

That does make sense to me.

> Isn't that roughly what we seem to get now?

I think so. :)

To answer Patrik's question about user-initiated (stonith_admin) vs. cluster-initiated fencing: In most respects they are identical, but there are some differences. In the case of cluster-initiated fencing, what is happening is the controller is doing the equivalent of running stonith_admin, and then when it gets the timeout, retrying. So, it should be the same as if you manually retried stonith_admin after the first run timed out.

Comment 11 Klaus Wenninger 2019-08-22 16:18:14 UTC

(In reply to Ken Gaillot from comment #10)

> 
> To answer Patrik's question about user-initiated (stonith_admin) vs.
> cluster-initiated fencing: In most respects they are identical, but there
> are some differences. In the case of cluster-initiated fencing, what is
> happening is the controller is doing the equivalent of running
> stonith_admin, and then when it gets the timeout, retrying. So, it should be
> the same as if you manually retried stonith_admin after the first run timed
> out.

Timeouts aren't taken from the cib but default to 90s or 120s or whatever
you specify. So that might influence effective behaviour as well.
But there is already a BZ to make the timeout default to what is configured
in the CIB.

Comment 13 errata-xmlrpc 2019-11-05 20:57:48 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:3385