Bug 1470262 - disabling a fencing-device that has queued actions leads to stonithd receiving SIGABRT
disabling a fencing-device that has queued actions leads to stonithd receivin...
Status: VERIFIED
Product: Red Hat Enterprise Linux 7
Classification: Red Hat
Component: pacemaker (Show other bugs)
7.4
Unspecified Unspecified
urgent Severity urgent
: rc
: 7.5
Assigned To: Ken Gaillot
cluster-qe@redhat.com
: ZStream
Depends On:
Blocks: 1481141
  Show dependency treegraph
 
Reported: 2017-07-12 11:43 EDT by Klaus Wenninger
Modified: 2017-12-07 11:33 EST (History)
8 users (show)

See Also:
Fixed In Version: pacemaker-1.1.18-1.el7
Doc Type: Bug Fix
Doc Text:
Previously, if a fencing device configured with the pcmk_delay_max setting was disabled while a fencing action was being delayed, Pacemaker's stonithd service attempted to free memory used for the action twice. As a consequence, Pacemaker terminated unexpectedly. With this update, stonithd has been fixed to free the memory only once, and as a result, the described problem no longer occurs.
Story Points: ---
Clone Of:
: 1481141 (view as bug list)
Environment:
Last Closed:
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description Klaus Wenninger 2017-07-12 11:43:59 EDT
Description of problem:
When a stonith-action is being delayed by pcmk_delay_max (or the new upstream attribute pcmk_delay_base) and the stonith-device is being disabled within
this waiting time stonith coredumps because it receives SIGABRT.

Version-Release number of selected component (if applicable):
1.1.17

How reproducible:
100%

Steps to Reproduce:
1. setup fencing with a fencing-device that has attribute pcmk_delay_max or pcmk_delay_base (just supported with upstream-master at that time)
2. trigger fencing of a node with e.g. pcs
3. use pcs to disable the fencing-resource while the delay configured is running

Actual results:
stonithd core dumps
the pcs-command for fencing is running into a timeout as the communication partner has died

Expected results:
no core dump
immediate failure result of pcs fencing command

Additional info:
Already fixed upstream
https://github.com/ClusterLabs/pacemaker/commit/e7027e9d303be5e3f9531c0cb0ef8af914f2adda
Comment 6 Patrik Hagara 2017-12-07 11:29:12 EST
Tested as per comment 1:
 * create stonith resource with pcmk_delay_max set to 300 seconds
 * start a fence operation ("pcs stonith fence ...")
 * while the fence operation is being delayed, run "pcs stonith disable ..."

Before the fix (1.1.16-12.el7):
 * the fence operation returns non-zero after almost double the pcmk_delay_max (eg. 564 seconds) with the message:
> Error: unable to fence 'virt-156'
> Command failed: Timer expired
 * the "pcs stonith disable ..." command hangs (likely) indefinitely
 * stonithd received SIGABRT on the node performing the fence operation
 * no node got fenced
 * node on which stonithd crashed transitions into "UNCLEAN (Online)" cluster membership status
 * fence resource marked as "FAILED (disabled)"

After the fix (1.1.18-1.el7):
 * the fence operation returns non-zero immediately with:
> Error: unable to fence 'virt-164'
> Command failed: No route to host
 * the "pcs stonith disable ..." command completes successfully
 * stonithd does not crash and no other crashes are detected by abrt
 * no node got fenced
 * all nodes are in a clean online cluster membership state
 * fence resource marked as "Stopped (disabled)"

Marking verified.
Comment 7 Patrik Hagara 2017-12-07 11:33:54 EST
Just to clarify the first point in "after the fix" should read:

 * the fence operation returns non-zero immediately __after "pcs stonith disable ..." completes__ with:

Note You need to log in before you can comment on or make changes to this bug.