Bug 1708380

Summary:	when pacemaker-fenced is automatically restarted after a segfault it doesn't trigger resyncing of the fence-history
Product:	Red Hat Enterprise Linux 8	Reporter:	Klaus Wenninger <kwenning>
Component:	pacemaker	Assignee:	Klaus Wenninger <kwenning>
Status:	CLOSED ERRATA	QA Contact:	cluster-qe <cluster-qe>
Severity:	unspecified	Docs Contact:
Priority:	unspecified
Version:	8.0	CC:	abeekhof, cluster-maint, cluster-qe, kgaillot, phagara
Target Milestone:	rc
Target Release:	8.1
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	pacemaker-2.0.2-2.el8	Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:	1708378	Environment:
Last Closed:	2019-11-05 20:57:48 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Klaus Wenninger 2019-05-09 18:02:10 UTC

+++ This bug was initially created as a clone of Bug #1708378 +++

Description of problem:

When fenced is e.g. segfaulting while this very instance has a pending fence-action this pending action stays in the history-list.
Unfortunately cleanup of fence-history using stonith_admin just takes care of failed and successful actions while there is no possibility to remove pending actions.
Fenced restarted after a segfault comes up with an empty fence-history and doesn't trigger syncing with the history known by the rest of the cluster.

Version-Release number of selected component (if applicable):

2.0.1-5.el8

How reproducible:

100%

Steps to Reproduce:
1. Use stonith_admin to trigger a fence-action that is ideally a little sluggish so that you have time to see it appear as pending with crm_mon
2. Issue 'killall -9 pacemaker-fenced' on the node the pending fence-action is carried out
3. 

Actual results:

Viewed from a different node in the cluster the pending fence-action stays persistently.
Viewed from the node where fenced had just been killed all fence-actions are gone (as well the past failed or successful ones).
There is no way to purge the pending fence-action using stonith_admin.

Expected results:

After some timeout at least the pending fence-action (that is actually not pending anymore) should go away (respectively be converted into a failed fence-action).
If there is some leftover pending fence-action it would be preferable to have a force-option in stonith_admin to be able to remove that manually.
When fenced is restarted it should resync the fence-history with the other nodes.

Additional info:

Comment 1 Patrik Hagara 2019-05-14 13:05:29 UTC

qa-ack+, reproducer in bug description

Comment 2 Klaus Wenninger 2019-06-11 07:08:46 UTC

https://github.com/ClusterLabs/pacemaker/pull/1805

Comment 4 Patrik Hagara 2019-09-02 10:18:23 UTC

Same verification steps as in https://bugzilla.redhat.com/show_bug.cgi?id=1708378#c5 apply (specifically the "the DC node will start with empty fencing history when pacemaker-fenced is restarted on it, after a minute or two the history is synced from the other node" part).

Marking verified in 2.0.2-2.el8.

Comment 6 errata-xmlrpc 2019-11-05 20:57:48 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:3385