Bug 1469255

Summary: stonith-action=poweroff leads to failure in fence-agent
Product: Red Hat Enterprise Linux 7 Reporter: Klaus Wenninger <kwenning>
Component: pacemakerAssignee: Ken Gaillot <kgaillot>
Status: CLOSED ERRATA QA Contact: michal novacek <mnovacek>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 7.4CC: abeekhof, cfeist, cluster-maint, jpokorny, kgaillot, mlisik, mnovacek
Target Milestone: rc   
Target Release: 7.6   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: pacemaker-1.1.18-12.el7 Doc Type: Bug Fix
Doc Text:
Cause: Fence agents do not support the "poweroff" action, even though it is documented as an allowed value for the stonith-action cluster property. Consequence: Setting the stonith-action cluster property to "poweroff" would cause fencing to fail. Fix: The "poweroff" value is now deprecated. If a configuration contains a value of "poweroff", Pacemaker will automatically convert it to "off", and log a deprecation warning. Result: Fencing works properly when stonith-action is set to "poweroff".
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-10-30 07:57:39 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Klaus Wenninger 2017-07-10 18:19:24 UTC
Description of problem:
According to documentation stonith-action is either set to reboot or poweroff.
In the poweroff-case this is propagated 1:1 into the RHCS-fence-agents which can't handle that action.


Version-Release number of selected component (if applicable):
found with upstream-master but there shouldn't be a difference to 1.1.17

How reproducible:
100%

Steps to Reproduce:
1. setup a config with a RHCS fencing-agent (e.g. fence_sbd)
2. pcs property set stonith-action=poweroff
3. have pacemaker trigger fencing e.g. by cutting the networking connection

Actual results:
Jul 10 10:52:46 [2520] bsul0799 stonith-ng:  warning: log_action:       fence_sbd[24239] stderr: [ Failed: Unrecognised action 'poweroff' ]

Expected results:
fence-agent properly turns off the fenced node

Additional info:
a test with 'pcs stonith fence ...' doesn't show the problem
stonith-action=off leads to pacemaker initiated fencing working properly but fencing with pcs is failing

Comment 2 michal novacek 2017-08-04 10:02:21 UTC
qa-ack+: setting stonith-action=poweroff must work for all the fence agents

Comment 3 Klaus Wenninger 2017-11-03 14:51:14 UTC
moved to rhel-7.6 due to effort constraints

Comment 4 Andrew Beekhof 2017-11-05 22:15:59 UTC
Really?  Shouldn't be that hard to map 'poweroff' to 'off' inside the stonith library

Comment 5 Ken Gaillot 2017-12-07 01:09:21 UTC
Pacemaker currently accepts the values "reboot", "off", or "poweroff" for stonith-action.

LHA-style external/* agents (which are supported upstream, but not in RHEL) do support "poweroff". Remapping "poweroff" to "off" globally would break those.

I see two reasonable approaches:

1. Drop support for stonith-action=poweroff. If someone wants to use poweroff with LHA agents, they must set stonith-action=off and pcmk_off_action=poweroff. (This is the cleanest and easiest option development-wise, but involves some pain for LHA users.)

2. Remap stonith-action=poweroff to stonith-action=off, and for LHA agents, also assume pcmk_off_action=poweroff if not otherwise set. (This is easiest for all users.)

Opinions?

Comment 6 Miroslav Lisik 2018-01-12 15:40:52 UTC
Suggestion for 3rd option:

3. Let pcs to do some checks during execution of commands 'pcs stonith create'
and 'pcs property set stonith-property=<value>'.

Before stonith resource is created, check if the value of 'stonith-action' is
supported by actions of fence agent.

Before 'stonith-action' is changed by 'pcs property', check if the given
value is coherent with actions of fence agent currently in use.

Comment 7 Ken Gaillot 2018-01-12 15:49:01 UTC
(In reply to Miroslav Lisik from comment #6)
> Suggestion for 3rd option:
> 
> 3. Let pcs to do some checks during execution of commands 'pcs stonith
> create'
> and 'pcs property set stonith-property=<value>'.
> 
> Before stonith resource is created, check if the value of 'stonith-action' is
> supported by actions of fence agent.
> 
> Before 'stonith-action' is changed by 'pcs property', check if the given
> value is coherent with actions of fence agent currently in use.

That's feasible, but a little more complicated than that. If stonith-action=off, and a fence agent doesn't support "off", it can still be used as long as its pcmk_off_action is set to an action it does support (and similarly for reboot). Also, users should be allowed to add devices that aren't used by the cluster (yet), so --force should override the check.

Even with such checks, we'd still need one of the other two options to handle both RH and LHA fence agents upstream.

Comment 8 Jan Pokorný [poki] 2018-02-14 14:01:17 UTC
What is the assumed target timeframe for this?

If pacemaker 2+, then unless we want to drop support for external
stonith agents completely, I'd follow up on Ken's suggestion to
possibly handle that implicitly on validation schema upgrade
involving necessary changes in the configuration by the means
of XSL transformation per 2. from [comment 5].

Comment 9 Jan Pokorný [poki] 2018-02-14 14:03:43 UTC
(2. being justified [also] with possible needs to combine the stonith
implementation providers)

Comment 10 Jan Pokorný [poki] 2018-02-14 14:26:36 UTC
Actually, thinking more about that in pacemaker 2 context, the
semantic aliases are just the burden we certainly want to get rid of
at the _user-facing level_ (so much easier not to have to explain
these gory details to the users!), hence it might be more beneficial
to ditch "poweroff" choice once for all there, despite it would be
applied under the hood.  The 1:1 continuity for existing configurations
with this approach can also be reached on XSL level as already sketched
by Ken elsewhere.  From [comment 9] standpoint, it should be equal
at the end of the day, but the overall simplicity would likely be
better.  The pain point of 1. for LHA agents could then be possibly
mitigated by the high-level tools (crm, pcs).

Just thinking aloud :)

Comment 11 Ken Gaillot 2018-02-19 18:54:03 UTC
After further investigation, the situation is simpler. LHA agents don't take poweroff either -- the fence_legacy wrapper accepts poweroff, and maps it to off when calling the agent.

So, Pacemaker can always map poweroff to off, as originally thought. We can log a deprecation warning, and no schema transform is required.

Comment 12 Ken Gaillot 2018-03-02 23:14:35 UTC
Fixed by upstream commit ebc8737f

Comment 14 michal novacek 2018-08-02 13:52:17 UTC
I have verified that fence-action=poweroff is correctly recognised in pacemaker-1.1.19-3.el7.x86_64.

---

Common setup:
-------------
1) Configure cluster with sbd_fencing (1), (2) and stonith-action=poweroff (3).
2) Cause kernel panic on one of the nodes to trigger fencing.


Before the fix (pacemaker-1.1.18-11.el7.x86_64)
-----------------------------------------------

fence_sbd returns 'Unrecognised action 'poweroff'

cat /var/log/messages
...
Aug  2 08:46:43 host-003 stonith-ng[30388]:  notice: Requesting peer fencing (reboot) of host-002
Aug  2 08:46:43 host-003 pengine[30476]:  notice: Watchdog will be used via SBD if fencing is required
Aug  2 08:46:43 host-003 pengine[30476]: warning: Cluster node host-002 will be fenced: peer is no longer part of the cluster
Aug  2 08:46:43 host-003 pengine[30476]: warning: Node host-002 is unclean
Aug  2 08:46:43 host-003 pengine[30476]: warning: Action dlm:0_stop_0 on host-002 is unrunnable (offline)
Aug  2 08:46:43 host-003 pengine[30476]: warning: Action dlm:0_stop_0 on host-002 is unrunnable (offline)
Aug  2 08:46:43 host-003 pengine[30476]: warning: Action clvmd:0_stop_0 on host-002 is unrunnable (offline)
Aug  2 08:46:43 host-003 pengine[30476]: warning: Action clvmd:0_stop_0 on host-002 is unrunnable (offline)
Aug  2 08:46:43 host-003 pengine[30476]: warning: Action fence-sbd_stop_0 on host-002 is unrunnable (offline)
Aug  2 08:46:43 host-003 pengine[30476]: warning: Scheduling Node host-002 for STONITH
Aug  2 08:46:43 host-003 pengine[30476]:  notice:  * Fence (poweroff) host-002 'peer is no longer part of the cluster'
Aug  2 08:46:43 host-003 pengine[30476]:  notice:  * Stop       dlm:0   ( host-002 )   due to node availability
Aug  2 08:46:43 host-003 pengine[30476]:  notice:  * Stop       clvmd:0     ( host-002 )   due to node availability
Aug  2 08:46:43 host-003 pengine[30476]:  notice:  * Move       fence-sbd   ( host-002 -> host-003 )
Aug  2 08:46:43 host-003 pengine[30476]: warning: Calculated transition 0 (with warnings), saving inputs in /var/lib/pacemaker/pengine/pe-warn-1.bz2
Aug  2 08:46:43 host-003 crmd[30477]:  notice: Requesting fencing (poweroff) of node host-002
Aug  2 08:46:43 host-003 crmd[30477]:  notice: Initiating start operation fence-sbd_start_0 locally on host-003
Aug  2 08:46:43 host-003 stonith-ng[30388]:  notice: Client crmd.30477.093895b9 wants to fence (poweroff) 'host-002' with device '(any)'
Aug  2 08:46:43 host-003 stonith-ng[30388]:  notice: Requesting peer fencing (poweroff) of host-002
Aug  2 08:46:43 host-003 stonith-ng[30388]:  notice: Watchdog will be used via SBD if fencing is required
Aug  2 08:46:43 host-003 lrmd[30367]:  notice: forwarder:4141:stderr [ Error when submitting alert: [Errno 111] Connection refused ]
Aug  2 08:46:44 host-003 crmd[30477]:  notice: Result of start operation for fence-sbd on host-003: 0 (ok)
Aug  2 08:46:44 host-003 crmd[30477]:  notice: Initiating monitor operation fence-sbd_monitor_60000 locally on host-003
Aug  2 08:46:44 host-003 stonith-ng[30388]:  notice: Watchdog will be used via SBD if fencing is required
Aug  2 08:46:44 host-003 stonith-ng[30388]:  notice: Watchdog will be used via SBD if fencing is required
Aug  2 08:46:44 host-003 lrmd[30367]:  notice: forwarder:4181:stderr [ Error when submitting alert: [Errno 111] Connection refused ]
Aug  2 08:46:44 host-003 stonith-ng[30388]:  notice: fence-sbd can fence (poweroff) host-002: dynamic-list
Aug  2 08:46:44 host-003 stonith-ng[30388]:  notice: fence-sbd can fence (reboot) host-002: dynamic-list
Aug  2 08:46:44 host-003 stonith-ng[30388]:  notice: Watchdog will be used via SBD if fencing is required
Aug  2 08:46:44 host-003 lrmd[30367]:  notice: forwarder:4201:stderr [ Error when submitting alert: [Errno 111] Connection refused ]
Aug  2 08:46:44 host-003 fence_sbd: Failed: Unrecognised action 'poweroff'
Aug  2 08:46:44 host-003 fence_sbd: Please use '-h' for usage
Aug  2 08:46:44 host-003 stonith-ng[30388]: warning: fence_sbd[4200] stderr: [ 2018-08-02 08:46:44,728 ERROR: Failed: Unrecognised action 'poweroff' ]
Aug  2 08:46:44 host-003 stonith-ng[30388]: warning: fence_sbd[4200] stderr: [  ]
Aug  2 08:46:44 host-003 stonith-ng[30388]: warning: fence_sbd[4200] stderr: [ 2018-08-02 08:46:44,729 ERROR: Please use '-h' for usage ]
Aug  2 08:46:44 host-003 stonith-ng[30388]: warning: fence_sbd[4200] stderr: [  ]
Aug  2 08:46:45 host-003 fence_sbd: Failed: Unrecognised action 'poweroff'
Aug  2 08:46:45 host-003 fence_sbd: Please use '-h' for usage
Aug  2 08:46:45 host-003 stonith-ng[30388]: warning: fence_sbd[4207] stderr: [ 2018-08-02 08:46:45,828 ERROR: Failed: Unrecognised action 'poweroff' ]
Aug  2 08:46:45 host-003 stonith-ng[30388]: warning: fence_sbd[4207] stderr: [  ]
Aug  2 08:46:45 host-003 stonith-ng[30388]: warning: fence_sbd[4207] stderr: [ 2018-08-02 08:46:45,828 ERROR: Please use '-h' for usage ]
Aug  2 08:46:45 host-003 stonith-ng[30388]: warning: fence_sbd[4207] stderr: [  ]
Aug  2 08:46:45 host-003 stonith-ng[30388]:   error: Operation 'poweroff' [4207] (call 2 from crmd.30477) for host 'host-002' with device 'fence-sbd' returned: -95 (Operation not supported)
Aug  2 08:46:45 host-003 stonith-ng[30388]:  notice: Couldn't find anyone to fence (poweroff) host-002 with any device


After the fix (pacemaker-1.1.19-5.el7.x86_64)
---------------------------------------------

Node is fenced with warning about deprecation of 'poweroff' in log files.

$ cat /var/log/messages
...
Aug  2 07:32:32 host-003 pengine[19887]:  notice: Watchdog will be used via SBD if fencing is required
>>> Aug  2 07:32:32 host-003 pengine[19887]: warning: Support for stonith-action of 'poweroff' is deprecated and will be removed in a future release (use 'off' instead)
Aug  2 07:32:32 host-003 pengine[19887]: warning: Cluster node host-002 will be fenced: peer is no longer part of the cluster
Aug  2 07:32:32 host-003 pengine[19887]: warning: Node host-002 is unclean
Aug  2 07:32:32 host-003 pengine[19887]: warning: Action dlm:0_stop_0 on host-002 is unrunnable (offline)
Aug  2 07:32:32 host-003 pengine[19887]: warning: Action dlm:0_stop_0 on host-002 is unrunnable (offline)
Aug  2 07:32:32 host-003 pengine[19887]: warning: Action clvmd:0_stop_0 on host-002 is unrunnable (offline)
Aug  2 07:32:32 host-003 pengine[19887]: warning: Action clvmd:0_stop_0 on host-002 is unrunnable (offline)
Aug  2 07:32:32 host-003 pengine[19887]: warning: Action fence-sbd_stop_0 on host-002 is unrunnable (offline)
Aug  2 07:32:32 host-003 pengine[19887]: warning: Scheduling Node host-002 for STONITH
Aug  2 07:32:32 host-003 pengine[19887]:  notice:  * Fence (off) host-002 'peer is no longer part of the cluster'
Aug  2 07:32:32 host-003 pengine[19887]:  notice:  * Stop       dlm:0   ( host-002 )   due to node availability
Aug  2 07:32:32 host-003 pengine[19887]:  notice:  * Stop       clvmd:0     ( host-002 )   due to node availability
Aug  2 07:32:32 host-003 pengine[19887]:  notice:  * Move       fence-sbd   ( host-002 -> host-003 )
Aug  2 07:32:32 host-003 pengine[19887]: warning: Calculated transition 0 (with warnings), saving inputs in /var/lib/pacemaker/pengine/pe-warn-0.bz2
Aug  2 07:32:32 host-003 crmd[19888]:  notice: Requesting fencing (off) of node host-002
Aug  2 07:32:32 host-003 crmd[19888]:  notice: Initiating start operation fence-sbd_start_0 locally on host-003
Aug  2 07:32:32 host-003 stonith-ng[19884]:  notice: Client crmd.19888.59334e90 wants to fence (off) 'host-002' with device '(any)'
Aug  2 07:32:32 host-003 stonith-ng[19884]:  notice: Requesting peer fencing (off) of host-002
Aug  2 07:32:32 host-003 stonith-ng[19884]:  notice: Watchdog will be used via SBD if fencing is required
Aug  2 07:32:32 host-003 lrmd[19885]:  notice: forwarder:25880:stderr [ Error when submitting alert: [Errno 111] Connection refused ]
Aug  2 07:32:33 host-003 stonith-ng[19884]:  notice: fence-sbd can fence (off) host-002: dynamic-list
Aug  2 07:32:33 host-003 stonith-ng[19884]:  notice: fence-sbd:0 can fence (off) host-002: dynamic-list
Aug  2 07:32:33 host-003 crmd[19888]:  notice: Result of start operation for fence-sbd on host-003: 0 (ok)
Aug  2 07:32:33 host-003 crmd[19888]:  notice: Initiating monitor operation fence-sbd_monitor_60000 locally on host-003
Aug  2 07:32:33 host-003 stonith-ng[19884]:  notice: Watchdog will be used via SBD if fencing is required
Aug  2 07:32:33 host-003 stonith-ng[19884]:  notice: Watchdog will be used via SBD if fencing is required
Aug  2 07:32:33 host-003 stonith-ng[19884]:  notice: fence-sbd can fence (reboot) host-002: dynamic-list
Aug  2 07:32:33 host-003 stonith-ng[19884]:  notice: fence-sbd:0 can fence (reboot) host-002: dynamic-list
Aug  2 07:32:33 host-003 lrmd[19885]:  notice: forwarder:25951:stderr [ Error when submitting alert: [Errno 111] Connection refused ]
Aug  2 07:32:44 host-003 stonith-ng[19884]:  notice: Operation 'off' [25950] (call 2 from crmd.19888) for host 'host-002' with device 'fence-sbd' returned: 0 (OK)

----

(1) pcs status
[root@host-003 ~]# pcs status
Cluster name: STSRHTS10103
Stack: corosync
Current DC: host-003 (version 1.1.19-3.el7-c3c624ea3d) - partition with quorum
Last updated: Thu Aug  2 07:36:13 2018
Last change: Thu Aug  2 06:57:43 2018 by root via cibadmin on host-002

2 nodes configured
5 resources configured

Online: [ host-002 host-003 ]

Full list of resources:

 Clone Set: dlm-clone [dlm]
     Started: [ host-002 host-003 ]
 Clone Set: clvmd-clone [clvmd]
     clvmd	(ocf::heartbeat:clvm):	Starting host-002
     Started: [ host-003 ]
 fence-sbd	(stonith:fence_sbd):	Started host-003

Daemon Status:
  corosync: active/disabled
  pacemaker: active/disabled
  pcsd: active/enabled
  sbd: active/enabled

(2) pcs config
[root@host-003 ~]# pcs config
Cluster Name: STSRHTS10103
Corosync Nodes:
 host-002 host-003
Pacemaker Nodes:
 host-002 host-003

Resources:
 Clone: dlm-clone
  Meta Attrs: interleave=true ordered=true
  Resource: dlm (class=ocf provider=pacemaker type=controld)
   Operations: monitor interval=30s on-fail=fence (dlm-monitor-interval-30s)
               start interval=0s timeout=90 (dlm-start-interval-0s)
               stop interval=0s timeout=100 (dlm-stop-interval-0s)
 Clone: clvmd-clone
  Meta Attrs: interleave=true ordered=true
  Resource: clvmd (class=ocf provider=heartbeat type=clvm)
   Attributes: with_cmirrord=1
   Operations: monitor interval=30s on-fail=fence (clvmd-monitor-interval-30s)
               start interval=0s timeout=90s (clvmd-start-interval-0s)
               stop interval=0s timeout=90s (clvmd-stop-interval-0s)

Stonith Devices:
 Resource: fence-sbd (class=stonith type=fence_sbd)
  Attributes: devices=/dev/disk/by-id/scsi-3600140565f448b3e6d8447b80f5e4010
  Operations: monitor interval=60s (fence-sbd-monitor-interval-60s)
Fencing Levels:

Location Constraints:
Ordering Constraints:
  start dlm-clone then start clvmd-clone (kind:Mandatory)
Colocation Constraints:
  clvmd-clone with dlm-clone (score:INFINITY)
Ticket Constraints:

Alerts:
 Alert: forwarder (path=/usr/tests/sts-rhel7.6/pacemaker/alerts/alert_forwarder.py)
  Recipients:
   Recipient: forwarder-recipient (value=http://host-001.virt.lab.msp.redhat.com:40879/)

Resources Defaults:
 No defaults set
Operations Defaults:
 No defaults set

Cluster Properties:
 cluster-infrastructure: corosync
 cluster-name: STSRHTS10103
 dc-version: 1.1.19-3.el7-c3c624ea3d
 have-watchdog: true
 no-quorum-policy: freeze
 stonith-action: poweroff

Quorum:
  Options:

(3) pcs property
[root@host-003 ~]# pcs property
Cluster Properties:
 cluster-infrastructure: corosync
 cluster-name: STSRHTS10103
 dc-version: 1.1.19-3.el7-c3c624ea3d
 have-watchdog: true
 no-quorum-policy: freeze
>stonith-action: poweroff

Comment 16 errata-xmlrpc 2018-10-30 07:57:39 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:3055