Bug 1708380

Summary: when pacemaker-fenced is automatically restarted after a segfault it doesn't trigger resyncing of the fence-history
Product: Red Hat Enterprise Linux 8 Reporter: Klaus Wenninger <kwenning>
Component: pacemakerAssignee: Klaus Wenninger <kwenning>
Status: CLOSED ERRATA QA Contact: cluster-qe <cluster-qe>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 8.0CC: abeekhof, cluster-maint, cluster-qe, kgaillot, phagara
Target Milestone: rc   
Target Release: 8.1   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: pacemaker-2.0.2-2.el8 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: 1708378 Environment:
Last Closed: 2019-11-05 20:57:48 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Klaus Wenninger 2019-05-09 18:02:10 UTC
+++ This bug was initially created as a clone of Bug #1708378 +++

Description of problem:

When fenced is e.g. segfaulting while this very instance has a pending fence-action this pending action stays in the history-list.
Unfortunately cleanup of fence-history using stonith_admin just takes care of failed and successful actions while there is no possibility to remove pending actions.
Fenced restarted after a segfault comes up with an empty fence-history and doesn't trigger syncing with the history known by the rest of the cluster.

Version-Release number of selected component (if applicable):

2.0.1-5.el8

How reproducible:

100%

Steps to Reproduce:
1. Use stonith_admin to trigger a fence-action that is ideally a little sluggish so that you have time to see it appear as pending with crm_mon
2. Issue 'killall -9 pacemaker-fenced' on the node the pending fence-action is carried out
3. 

Actual results:

Viewed from a different node in the cluster the pending fence-action stays persistently.
Viewed from the node where fenced had just been killed all fence-actions are gone (as well the past failed or successful ones).
There is no way to purge the pending fence-action using stonith_admin.

Expected results:

After some timeout at least the pending fence-action (that is actually not pending anymore) should go away (respectively be converted into a failed fence-action).
If there is some leftover pending fence-action it would be preferable to have a force-option in stonith_admin to be able to remove that manually.
When fenced is restarted it should resync the fence-history with the other nodes.

Additional info:

Comment 1 Patrik Hagara 2019-05-14 13:05:29 UTC
qa-ack+, reproducer in bug description

Comment 2 Klaus Wenninger 2019-06-11 07:08:46 UTC
https://github.com/ClusterLabs/pacemaker/pull/1805

Comment 4 Patrik Hagara 2019-09-02 10:18:23 UTC
Same verification steps as in https://bugzilla.redhat.com/show_bug.cgi?id=1708378#c5 apply (specifically the "the DC node will start with empty fencing history when pacemaker-fenced is restarted on it, after a minute or two the history is synced from the other node" part).

Marking verified in 2.0.2-2.el8.

Comment 6 errata-xmlrpc 2019-11-05 20:57:48 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:3385