Bug 1470262 - disabling a fencing-device that has queued actions leads to stonithd receiving SIGABRT
disabling a fencing-device that has queued actions leads to stonithd receivin...
Product: Red Hat Enterprise Linux 7
Classification: Red Hat
Component: pacemaker (Show other bugs)
Unspecified Unspecified
urgent Severity urgent
: rc
: 7.5
Assigned To: Ken Gaillot
: ZStream
Depends On:
Blocks: 1481141
  Show dependency treegraph
Reported: 2017-07-12 11:43 EDT by Klaus Wenninger
Modified: 2018-04-10 11:32 EDT (History)
8 users (show)

See Also:
Fixed In Version: pacemaker-1.1.18-1.el7
Doc Type: Bug Fix
Doc Text:
Previously, if a fencing device configured with the pcmk_delay_max setting was disabled while a fencing action was being delayed, Pacemaker's stonithd service attempted to free memory used for the action twice. As a consequence, Pacemaker terminated unexpectedly. With this update, stonithd has been fixed to free the memory only once, and as a result, the described problem no longer occurs.
Story Points: ---
Clone Of:
: 1481141 (view as bug list)
Last Closed: 2018-04-10 11:30:29 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---

Attachments (Terms of Use)

External Trackers
Tracker ID Priority Status Summary Last Updated
Red Hat Product Errata RHEA-2018:0860 None None None 2018-04-10 11:32 EDT

  None (edit)
Description Klaus Wenninger 2017-07-12 11:43:59 EDT
Description of problem:
When a stonith-action is being delayed by pcmk_delay_max (or the new upstream attribute pcmk_delay_base) and the stonith-device is being disabled within
this waiting time stonith coredumps because it receives SIGABRT.

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:
1. setup fencing with a fencing-device that has attribute pcmk_delay_max or pcmk_delay_base (just supported with upstream-master at that time)
2. trigger fencing of a node with e.g. pcs
3. use pcs to disable the fencing-resource while the delay configured is running

Actual results:
stonithd core dumps
the pcs-command for fencing is running into a timeout as the communication partner has died

Expected results:
no core dump
immediate failure result of pcs fencing command

Additional info:
Already fixed upstream
Comment 6 Patrik Hagara 2017-12-07 11:29:12 EST
Tested as per comment 1:
 * create stonith resource with pcmk_delay_max set to 300 seconds
 * start a fence operation ("pcs stonith fence ...")
 * while the fence operation is being delayed, run "pcs stonith disable ..."

Before the fix (1.1.16-12.el7):
 * the fence operation returns non-zero after almost double the pcmk_delay_max (eg. 564 seconds) with the message:
> Error: unable to fence 'virt-156'
> Command failed: Timer expired
 * the "pcs stonith disable ..." command hangs (likely) indefinitely
 * stonithd received SIGABRT on the node performing the fence operation
 * no node got fenced
 * node on which stonithd crashed transitions into "UNCLEAN (Online)" cluster membership status
 * fence resource marked as "FAILED (disabled)"

After the fix (1.1.18-1.el7):
 * the fence operation returns non-zero immediately with:
> Error: unable to fence 'virt-164'
> Command failed: No route to host
 * the "pcs stonith disable ..." command completes successfully
 * stonithd does not crash and no other crashes are detected by abrt
 * no node got fenced
 * all nodes are in a clean online cluster membership state
 * fence resource marked as "Stopped (disabled)"

Marking verified.
Comment 7 Patrik Hagara 2017-12-07 11:33:54 EST
Just to clarify the first point in "after the fix" should read:

 * the fence operation returns non-zero immediately __after "pcs stonith disable ..." completes__ with:
Comment 10 errata-xmlrpc 2018-04-10 11:30:29 EDT
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.


Note You need to log in before you can comment on or make changes to this bug.