Bug 1470262

Summary: disabling a fencing-device that has queued actions leads to stonithd receiving SIGABRT
Product: Red Hat Enterprise Linux 7 Reporter: Klaus Wenninger <kwenning>
Component: pacemakerAssignee: Ken Gaillot <kgaillot>
Status: CLOSED ERRATA QA Contact: cluster-qe <cluster-qe>
Severity: urgent Docs Contact:
Priority: urgent    
Version: 7.4CC: abeekhof, aherr, cfeist, cluster-maint, jruemker, mnovacek, nbarcet, phagara
Target Milestone: rcKeywords: ZStream
Target Release: 7.5   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: pacemaker-1.1.18-1.el7 Doc Type: Bug Fix
Doc Text:
Previously, if a fencing device configured with the pcmk_delay_max setting was disabled while a fencing action was being delayed, Pacemaker's stonithd service attempted to free memory used for the action twice. As a consequence, Pacemaker terminated unexpectedly. With this update, stonithd has been fixed to free the memory only once, and as a result, the described problem no longer occurs.
Story Points: ---
Clone Of:
: 1481141 (view as bug list) Environment:
Last Closed: 2018-04-10 15:30:29 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1481141    

Description Klaus Wenninger 2017-07-12 15:43:59 UTC
Description of problem:
When a stonith-action is being delayed by pcmk_delay_max (or the new upstream attribute pcmk_delay_base) and the stonith-device is being disabled within
this waiting time stonith coredumps because it receives SIGABRT.

Version-Release number of selected component (if applicable):
1.1.17

How reproducible:
100%

Steps to Reproduce:
1. setup fencing with a fencing-device that has attribute pcmk_delay_max or pcmk_delay_base (just supported with upstream-master at that time)
2. trigger fencing of a node with e.g. pcs
3. use pcs to disable the fencing-resource while the delay configured is running

Actual results:
stonithd core dumps
the pcs-command for fencing is running into a timeout as the communication partner has died

Expected results:
no core dump
immediate failure result of pcs fencing command

Additional info:
Already fixed upstream
https://github.com/ClusterLabs/pacemaker/commit/e7027e9d303be5e3f9531c0cb0ef8af914f2adda

Comment 6 Patrik Hagara 2017-12-07 16:29:12 UTC
Tested as per comment 1:
 * create stonith resource with pcmk_delay_max set to 300 seconds
 * start a fence operation ("pcs stonith fence ...")
 * while the fence operation is being delayed, run "pcs stonith disable ..."

Before the fix (1.1.16-12.el7):
 * the fence operation returns non-zero after almost double the pcmk_delay_max (eg. 564 seconds) with the message:
> Error: unable to fence 'virt-156'
> Command failed: Timer expired
 * the "pcs stonith disable ..." command hangs (likely) indefinitely
 * stonithd received SIGABRT on the node performing the fence operation
 * no node got fenced
 * node on which stonithd crashed transitions into "UNCLEAN (Online)" cluster membership status
 * fence resource marked as "FAILED (disabled)"

After the fix (1.1.18-1.el7):
 * the fence operation returns non-zero immediately with:
> Error: unable to fence 'virt-164'
> Command failed: No route to host
 * the "pcs stonith disable ..." command completes successfully
 * stonithd does not crash and no other crashes are detected by abrt
 * no node got fenced
 * all nodes are in a clean online cluster membership state
 * fence resource marked as "Stopped (disabled)"

Marking verified.

Comment 7 Patrik Hagara 2017-12-07 16:33:54 UTC
Just to clarify the first point in "after the fix" should read:

 * the fence operation returns non-zero immediately __after "pcs stonith disable ..." completes__ with:

Comment 10 errata-xmlrpc 2018-04-10 15:30:29 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2018:0860