Bug 1469255
Summary: | stonith-action=poweroff leads to failure in fence-agent | ||
---|---|---|---|
Product: | Red Hat Enterprise Linux 7 | Reporter: | Klaus Wenninger <kwenning> |
Component: | pacemaker | Assignee: | Ken Gaillot <kgaillot> |
Status: | CLOSED ERRATA | QA Contact: | michal novacek <mnovacek> |
Severity: | unspecified | Docs Contact: | |
Priority: | unspecified | ||
Version: | 7.4 | CC: | abeekhof, cfeist, cluster-maint, jpokorny, kgaillot, mlisik, mnovacek |
Target Milestone: | rc | ||
Target Release: | 7.6 | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | pacemaker-1.1.18-12.el7 | Doc Type: | Bug Fix |
Doc Text: |
Cause: Fence agents do not support the "poweroff" action, even though it is documented as an allowed value for the stonith-action cluster property.
Consequence: Setting the stonith-action cluster property to "poweroff" would cause fencing to fail.
Fix: The "poweroff" value is now deprecated. If a configuration contains a value of "poweroff", Pacemaker will automatically convert it to "off", and log a deprecation warning.
Result: Fencing works properly when stonith-action is set to "poweroff".
|
Story Points: | --- |
Clone Of: | Environment: | ||
Last Closed: | 2018-10-30 07:57:39 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Klaus Wenninger
2017-07-10 18:19:24 UTC
qa-ack+: setting stonith-action=poweroff must work for all the fence agents moved to rhel-7.6 due to effort constraints Really? Shouldn't be that hard to map 'poweroff' to 'off' inside the stonith library Pacemaker currently accepts the values "reboot", "off", or "poweroff" for stonith-action. LHA-style external/* agents (which are supported upstream, but not in RHEL) do support "poweroff". Remapping "poweroff" to "off" globally would break those. I see two reasonable approaches: 1. Drop support for stonith-action=poweroff. If someone wants to use poweroff with LHA agents, they must set stonith-action=off and pcmk_off_action=poweroff. (This is the cleanest and easiest option development-wise, but involves some pain for LHA users.) 2. Remap stonith-action=poweroff to stonith-action=off, and for LHA agents, also assume pcmk_off_action=poweroff if not otherwise set. (This is easiest for all users.) Opinions? Suggestion for 3rd option: 3. Let pcs to do some checks during execution of commands 'pcs stonith create' and 'pcs property set stonith-property=<value>'. Before stonith resource is created, check if the value of 'stonith-action' is supported by actions of fence agent. Before 'stonith-action' is changed by 'pcs property', check if the given value is coherent with actions of fence agent currently in use. (In reply to Miroslav Lisik from comment #6) > Suggestion for 3rd option: > > 3. Let pcs to do some checks during execution of commands 'pcs stonith > create' > and 'pcs property set stonith-property=<value>'. > > Before stonith resource is created, check if the value of 'stonith-action' is > supported by actions of fence agent. > > Before 'stonith-action' is changed by 'pcs property', check if the given > value is coherent with actions of fence agent currently in use. That's feasible, but a little more complicated than that. If stonith-action=off, and a fence agent doesn't support "off", it can still be used as long as its pcmk_off_action is set to an action it does support (and similarly for reboot). Also, users should be allowed to add devices that aren't used by the cluster (yet), so --force should override the check. Even with such checks, we'd still need one of the other two options to handle both RH and LHA fence agents upstream. What is the assumed target timeframe for this? If pacemaker 2+, then unless we want to drop support for external stonith agents completely, I'd follow up on Ken's suggestion to possibly handle that implicitly on validation schema upgrade involving necessary changes in the configuration by the means of XSL transformation per 2. from [comment 5]. (2. being justified [also] with possible needs to combine the stonith implementation providers) Actually, thinking more about that in pacemaker 2 context, the semantic aliases are just the burden we certainly want to get rid of at the _user-facing level_ (so much easier not to have to explain these gory details to the users!), hence it might be more beneficial to ditch "poweroff" choice once for all there, despite it would be applied under the hood. The 1:1 continuity for existing configurations with this approach can also be reached on XSL level as already sketched by Ken elsewhere. From [comment 9] standpoint, it should be equal at the end of the day, but the overall simplicity would likely be better. The pain point of 1. for LHA agents could then be possibly mitigated by the high-level tools (crm, pcs). Just thinking aloud :) After further investigation, the situation is simpler. LHA agents don't take poweroff either -- the fence_legacy wrapper accepts poweroff, and maps it to off when calling the agent. So, Pacemaker can always map poweroff to off, as originally thought. We can log a deprecation warning, and no schema transform is required. Fixed by upstream commit ebc8737f I have verified that fence-action=poweroff is correctly recognised in pacemaker-1.1.19-3.el7.x86_64. --- Common setup: ------------- 1) Configure cluster with sbd_fencing (1), (2) and stonith-action=poweroff (3). 2) Cause kernel panic on one of the nodes to trigger fencing. Before the fix (pacemaker-1.1.18-11.el7.x86_64) ----------------------------------------------- fence_sbd returns 'Unrecognised action 'poweroff' cat /var/log/messages ... Aug 2 08:46:43 host-003 stonith-ng[30388]: notice: Requesting peer fencing (reboot) of host-002 Aug 2 08:46:43 host-003 pengine[30476]: notice: Watchdog will be used via SBD if fencing is required Aug 2 08:46:43 host-003 pengine[30476]: warning: Cluster node host-002 will be fenced: peer is no longer part of the cluster Aug 2 08:46:43 host-003 pengine[30476]: warning: Node host-002 is unclean Aug 2 08:46:43 host-003 pengine[30476]: warning: Action dlm:0_stop_0 on host-002 is unrunnable (offline) Aug 2 08:46:43 host-003 pengine[30476]: warning: Action dlm:0_stop_0 on host-002 is unrunnable (offline) Aug 2 08:46:43 host-003 pengine[30476]: warning: Action clvmd:0_stop_0 on host-002 is unrunnable (offline) Aug 2 08:46:43 host-003 pengine[30476]: warning: Action clvmd:0_stop_0 on host-002 is unrunnable (offline) Aug 2 08:46:43 host-003 pengine[30476]: warning: Action fence-sbd_stop_0 on host-002 is unrunnable (offline) Aug 2 08:46:43 host-003 pengine[30476]: warning: Scheduling Node host-002 for STONITH Aug 2 08:46:43 host-003 pengine[30476]: notice: * Fence (poweroff) host-002 'peer is no longer part of the cluster' Aug 2 08:46:43 host-003 pengine[30476]: notice: * Stop dlm:0 ( host-002 ) due to node availability Aug 2 08:46:43 host-003 pengine[30476]: notice: * Stop clvmd:0 ( host-002 ) due to node availability Aug 2 08:46:43 host-003 pengine[30476]: notice: * Move fence-sbd ( host-002 -> host-003 ) Aug 2 08:46:43 host-003 pengine[30476]: warning: Calculated transition 0 (with warnings), saving inputs in /var/lib/pacemaker/pengine/pe-warn-1.bz2 Aug 2 08:46:43 host-003 crmd[30477]: notice: Requesting fencing (poweroff) of node host-002 Aug 2 08:46:43 host-003 crmd[30477]: notice: Initiating start operation fence-sbd_start_0 locally on host-003 Aug 2 08:46:43 host-003 stonith-ng[30388]: notice: Client crmd.30477.093895b9 wants to fence (poweroff) 'host-002' with device '(any)' Aug 2 08:46:43 host-003 stonith-ng[30388]: notice: Requesting peer fencing (poweroff) of host-002 Aug 2 08:46:43 host-003 stonith-ng[30388]: notice: Watchdog will be used via SBD if fencing is required Aug 2 08:46:43 host-003 lrmd[30367]: notice: forwarder:4141:stderr [ Error when submitting alert: [Errno 111] Connection refused ] Aug 2 08:46:44 host-003 crmd[30477]: notice: Result of start operation for fence-sbd on host-003: 0 (ok) Aug 2 08:46:44 host-003 crmd[30477]: notice: Initiating monitor operation fence-sbd_monitor_60000 locally on host-003 Aug 2 08:46:44 host-003 stonith-ng[30388]: notice: Watchdog will be used via SBD if fencing is required Aug 2 08:46:44 host-003 stonith-ng[30388]: notice: Watchdog will be used via SBD if fencing is required Aug 2 08:46:44 host-003 lrmd[30367]: notice: forwarder:4181:stderr [ Error when submitting alert: [Errno 111] Connection refused ] Aug 2 08:46:44 host-003 stonith-ng[30388]: notice: fence-sbd can fence (poweroff) host-002: dynamic-list Aug 2 08:46:44 host-003 stonith-ng[30388]: notice: fence-sbd can fence (reboot) host-002: dynamic-list Aug 2 08:46:44 host-003 stonith-ng[30388]: notice: Watchdog will be used via SBD if fencing is required Aug 2 08:46:44 host-003 lrmd[30367]: notice: forwarder:4201:stderr [ Error when submitting alert: [Errno 111] Connection refused ] Aug 2 08:46:44 host-003 fence_sbd: Failed: Unrecognised action 'poweroff' Aug 2 08:46:44 host-003 fence_sbd: Please use '-h' for usage Aug 2 08:46:44 host-003 stonith-ng[30388]: warning: fence_sbd[4200] stderr: [ 2018-08-02 08:46:44,728 ERROR: Failed: Unrecognised action 'poweroff' ] Aug 2 08:46:44 host-003 stonith-ng[30388]: warning: fence_sbd[4200] stderr: [ ] Aug 2 08:46:44 host-003 stonith-ng[30388]: warning: fence_sbd[4200] stderr: [ 2018-08-02 08:46:44,729 ERROR: Please use '-h' for usage ] Aug 2 08:46:44 host-003 stonith-ng[30388]: warning: fence_sbd[4200] stderr: [ ] Aug 2 08:46:45 host-003 fence_sbd: Failed: Unrecognised action 'poweroff' Aug 2 08:46:45 host-003 fence_sbd: Please use '-h' for usage Aug 2 08:46:45 host-003 stonith-ng[30388]: warning: fence_sbd[4207] stderr: [ 2018-08-02 08:46:45,828 ERROR: Failed: Unrecognised action 'poweroff' ] Aug 2 08:46:45 host-003 stonith-ng[30388]: warning: fence_sbd[4207] stderr: [ ] Aug 2 08:46:45 host-003 stonith-ng[30388]: warning: fence_sbd[4207] stderr: [ 2018-08-02 08:46:45,828 ERROR: Please use '-h' for usage ] Aug 2 08:46:45 host-003 stonith-ng[30388]: warning: fence_sbd[4207] stderr: [ ] Aug 2 08:46:45 host-003 stonith-ng[30388]: error: Operation 'poweroff' [4207] (call 2 from crmd.30477) for host 'host-002' with device 'fence-sbd' returned: -95 (Operation not supported) Aug 2 08:46:45 host-003 stonith-ng[30388]: notice: Couldn't find anyone to fence (poweroff) host-002 with any device After the fix (pacemaker-1.1.19-5.el7.x86_64) --------------------------------------------- Node is fenced with warning about deprecation of 'poweroff' in log files. $ cat /var/log/messages ... Aug 2 07:32:32 host-003 pengine[19887]: notice: Watchdog will be used via SBD if fencing is required >>> Aug 2 07:32:32 host-003 pengine[19887]: warning: Support for stonith-action of 'poweroff' is deprecated and will be removed in a future release (use 'off' instead) Aug 2 07:32:32 host-003 pengine[19887]: warning: Cluster node host-002 will be fenced: peer is no longer part of the cluster Aug 2 07:32:32 host-003 pengine[19887]: warning: Node host-002 is unclean Aug 2 07:32:32 host-003 pengine[19887]: warning: Action dlm:0_stop_0 on host-002 is unrunnable (offline) Aug 2 07:32:32 host-003 pengine[19887]: warning: Action dlm:0_stop_0 on host-002 is unrunnable (offline) Aug 2 07:32:32 host-003 pengine[19887]: warning: Action clvmd:0_stop_0 on host-002 is unrunnable (offline) Aug 2 07:32:32 host-003 pengine[19887]: warning: Action clvmd:0_stop_0 on host-002 is unrunnable (offline) Aug 2 07:32:32 host-003 pengine[19887]: warning: Action fence-sbd_stop_0 on host-002 is unrunnable (offline) Aug 2 07:32:32 host-003 pengine[19887]: warning: Scheduling Node host-002 for STONITH Aug 2 07:32:32 host-003 pengine[19887]: notice: * Fence (off) host-002 'peer is no longer part of the cluster' Aug 2 07:32:32 host-003 pengine[19887]: notice: * Stop dlm:0 ( host-002 ) due to node availability Aug 2 07:32:32 host-003 pengine[19887]: notice: * Stop clvmd:0 ( host-002 ) due to node availability Aug 2 07:32:32 host-003 pengine[19887]: notice: * Move fence-sbd ( host-002 -> host-003 ) Aug 2 07:32:32 host-003 pengine[19887]: warning: Calculated transition 0 (with warnings), saving inputs in /var/lib/pacemaker/pengine/pe-warn-0.bz2 Aug 2 07:32:32 host-003 crmd[19888]: notice: Requesting fencing (off) of node host-002 Aug 2 07:32:32 host-003 crmd[19888]: notice: Initiating start operation fence-sbd_start_0 locally on host-003 Aug 2 07:32:32 host-003 stonith-ng[19884]: notice: Client crmd.19888.59334e90 wants to fence (off) 'host-002' with device '(any)' Aug 2 07:32:32 host-003 stonith-ng[19884]: notice: Requesting peer fencing (off) of host-002 Aug 2 07:32:32 host-003 stonith-ng[19884]: notice: Watchdog will be used via SBD if fencing is required Aug 2 07:32:32 host-003 lrmd[19885]: notice: forwarder:25880:stderr [ Error when submitting alert: [Errno 111] Connection refused ] Aug 2 07:32:33 host-003 stonith-ng[19884]: notice: fence-sbd can fence (off) host-002: dynamic-list Aug 2 07:32:33 host-003 stonith-ng[19884]: notice: fence-sbd:0 can fence (off) host-002: dynamic-list Aug 2 07:32:33 host-003 crmd[19888]: notice: Result of start operation for fence-sbd on host-003: 0 (ok) Aug 2 07:32:33 host-003 crmd[19888]: notice: Initiating monitor operation fence-sbd_monitor_60000 locally on host-003 Aug 2 07:32:33 host-003 stonith-ng[19884]: notice: Watchdog will be used via SBD if fencing is required Aug 2 07:32:33 host-003 stonith-ng[19884]: notice: Watchdog will be used via SBD if fencing is required Aug 2 07:32:33 host-003 stonith-ng[19884]: notice: fence-sbd can fence (reboot) host-002: dynamic-list Aug 2 07:32:33 host-003 stonith-ng[19884]: notice: fence-sbd:0 can fence (reboot) host-002: dynamic-list Aug 2 07:32:33 host-003 lrmd[19885]: notice: forwarder:25951:stderr [ Error when submitting alert: [Errno 111] Connection refused ] Aug 2 07:32:44 host-003 stonith-ng[19884]: notice: Operation 'off' [25950] (call 2 from crmd.19888) for host 'host-002' with device 'fence-sbd' returned: 0 (OK) ---- (1) pcs status [root@host-003 ~]# pcs status Cluster name: STSRHTS10103 Stack: corosync Current DC: host-003 (version 1.1.19-3.el7-c3c624ea3d) - partition with quorum Last updated: Thu Aug 2 07:36:13 2018 Last change: Thu Aug 2 06:57:43 2018 by root via cibadmin on host-002 2 nodes configured 5 resources configured Online: [ host-002 host-003 ] Full list of resources: Clone Set: dlm-clone [dlm] Started: [ host-002 host-003 ] Clone Set: clvmd-clone [clvmd] clvmd (ocf::heartbeat:clvm): Starting host-002 Started: [ host-003 ] fence-sbd (stonith:fence_sbd): Started host-003 Daemon Status: corosync: active/disabled pacemaker: active/disabled pcsd: active/enabled sbd: active/enabled (2) pcs config [root@host-003 ~]# pcs config Cluster Name: STSRHTS10103 Corosync Nodes: host-002 host-003 Pacemaker Nodes: host-002 host-003 Resources: Clone: dlm-clone Meta Attrs: interleave=true ordered=true Resource: dlm (class=ocf provider=pacemaker type=controld) Operations: monitor interval=30s on-fail=fence (dlm-monitor-interval-30s) start interval=0s timeout=90 (dlm-start-interval-0s) stop interval=0s timeout=100 (dlm-stop-interval-0s) Clone: clvmd-clone Meta Attrs: interleave=true ordered=true Resource: clvmd (class=ocf provider=heartbeat type=clvm) Attributes: with_cmirrord=1 Operations: monitor interval=30s on-fail=fence (clvmd-monitor-interval-30s) start interval=0s timeout=90s (clvmd-start-interval-0s) stop interval=0s timeout=90s (clvmd-stop-interval-0s) Stonith Devices: Resource: fence-sbd (class=stonith type=fence_sbd) Attributes: devices=/dev/disk/by-id/scsi-3600140565f448b3e6d8447b80f5e4010 Operations: monitor interval=60s (fence-sbd-monitor-interval-60s) Fencing Levels: Location Constraints: Ordering Constraints: start dlm-clone then start clvmd-clone (kind:Mandatory) Colocation Constraints: clvmd-clone with dlm-clone (score:INFINITY) Ticket Constraints: Alerts: Alert: forwarder (path=/usr/tests/sts-rhel7.6/pacemaker/alerts/alert_forwarder.py) Recipients: Recipient: forwarder-recipient (value=http://host-001.virt.lab.msp.redhat.com:40879/) Resources Defaults: No defaults set Operations Defaults: No defaults set Cluster Properties: cluster-infrastructure: corosync cluster-name: STSRHTS10103 dc-version: 1.1.19-3.el7-c3c624ea3d have-watchdog: true no-quorum-policy: freeze stonith-action: poweroff Quorum: Options: (3) pcs property [root@host-003 ~]# pcs property Cluster Properties: cluster-infrastructure: corosync cluster-name: STSRHTS10103 dc-version: 1.1.19-3.el7-c3c624ea3d have-watchdog: true no-quorum-policy: freeze >stonith-action: poweroff Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2018:3055 |