Bug 1954099

Summary: Prevent fence_sbd in combination with stonith-watchdog-timeout>0
Product: Red Hat Enterprise Linux 8 Reporter: Nina Hostakova <nhostako>
Component: pcsAssignee: Tomas Jelinek <tojeline>
Status: CLOSED ERRATA QA Contact: cluster-qe <cluster-qe>
Severity: urgent Docs Contact: Steven J. Levine <slevine>
Priority: urgent    
Version: 8.4CC: cluster-maint, idevat, kmalyjur, kwenning, lichen, mlisik, mmazoure, mpospisi, omular, sbradley, tojeline
Target Milestone: betaKeywords: Triaged
Target Release: 8.7   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: pcs-0.10.13-1.el8 Doc Type: Bug Fix
Doc Text:
.`pcs` now validates the value of `stonith-watchdog-timeout` Previously, it was possible to set the `stonith-watchdog-timeout` property to a value that is incompatible with SBD configuration. This could result in a fence loop, or could cause the cluster to consider a fencing action to be successful even if the action is not finished. With this fix, `pcs` validates the value of `stonith-watchdog-property` when you set it, to prevent incorrect configuration.
Story Points: ---
Clone Of:
: 2058246 (view as bug list) Environment:
Last Closed: 2022-11-08 09:12:53 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Comment 2 Tomas Jelinek 2021-04-29 08:23:59 UTC
There are cluster properties managed automatically by pacemaker or pcs: cluster-infrastructure, cluster-name, dc-version, have-watchdog, last-lrm-refresh, stonith-watchdog-timeout. The 'pcs property' commands could check if users are trying to modify those, print an error message saying those are managed automatically, and require --force to proceed changing them.

Comment 3 Nina Hostakova 2021-04-29 12:30:16 UTC
This approach would probably not work for watchdog-only sbd fencing where, on the contrary, the stonith-watchdog-timeout property needs to be set up manually so that watchdog fencing works properly.

Comment 4 Nina Hostakova 2021-04-30 12:18:16 UTC
Digging more into stonith-watchdog-timeout, we have also found issues when configuring the property with pcs for sbd watchdog fencing (no disks):


1. When setting up stonith-watchdog-timeout property, the value needs to exceed SBD_WATCHDOG_TIMEOUT. If it doesn't, the cluster will end up in a fencing loop on all nodes (if cluster enabled). 'pcs property set' should give an error and forbid to set it instead

[root@virt-247 ~]# pcs stonith sbd config
SBD_DELAY_START=no
SBD_STARTMODE=always
SBD_WATCHDOG_TIMEOUT=5

Watchdogs:
  virt-247: /dev/watchdog
  virt-246: /dev/watchdog
  virt-248: /dev/watchdog

[root@virt-247 ~]# pcs property show --all | grep stonith-watchdog-timeout
 stonith-watchdog-timeout: 0

[root@virt-247 ~]# pcs property set stonith-watchdog-timeout=3

Broadcast message from systemd-journald.lab.eng.brq.redhat.com (Fri 2021-04-30 11:59:31 CEST):

pacemaker-controld[7343]:  emerg: Shutting down: stonith-watchdog-timeout (3) too short (must be >5000ms)


2. If stonith-watchdog-timeout property is set to a negative number, pcs rejects this value with an error (--force need to be used), even though it should be a supported configuration:

# man pacemaker-controld
...
If `stonith-watchdog-timeout` is set to a negative value, and
`SBD_WATCHDOG_TIMEOUT` is set, twice that value will be used.
...
[root@virt-023 ~]# pcs property set stonith-watchdog-timeout=-1
Error: invalid value of property: 'stonith-watchdog-timeout=-1', (use --force to override)
[root@virt-023 ~]# echo $?
1


The scope of the original bz is only to prevent misconfig for sbd with disks. So now the question is, if to use this bz for fixing overal pcs validation of the stonith-watchdog-timeout property or to create separate bzs for individual issues. Depends which approach works better for pcs.

Comment 6 Tomas Jelinek 2021-05-06 09:15:20 UTC
To summarize things up:
* When SBD is used without devices, then stonith-watchdog-timeout must be set to a value greater than SBD_WATCHDOG_TIMEOUT. This cannot be done automatically by pcs, as the property must be set after the cluster is restarted. The restart is not done automatically by pcs and is left to be done by users so that it does not disrupt cluster operation.
* When SBD is used with devices, then stonith-watchdog-timeout must not be set to value greater than 0.

Action items:
When stonith-watchdog-timeout property is being set by a user, check whether SBD is used with or without devices. If devices are used, prevent the property to be set if its value is not 0 or empty. If devices are not used, prevent the property to be set if its value is not greater than SBD_WATCHDOG_TIMEOUT. If SBD is not used at all, prevent the property to be set if its value is not 0 or empty.

Comment 7 Nina Hostakova 2021-05-06 11:39:58 UTC
Tomas, thanks for summing up, that is exactly what we think should be done.

Comment 12 Tomas Jelinek 2022-03-28 12:04:02 UTC
Upstream patch: https://github.com/ClusterLabs/pcs/commit/f3561eabe69cd3584673040780900c589f64f3b4

Test:
Using 'pcs property set stonith-watchdog-timeout=<value>', set stonith-watchdog-timeout to
* 0,
* a value greater than SBD_WATCHDOG_TIMEOUT,
* a value not greater than SBD_WATCHDOG_TIMEOUT,
* an empty value (unset).
Do it while:
* SBD is disabled,
* SBD is enabled with no devices,
* SBD is enabled with devices.
Verify that pcs returns an error and doesn't set the property in situation which would lead to fence loops or unreliable fencing.

Comment 14 Miroslav Lisik 2022-05-26 08:42:04 UTC
DevTestResults:

[root@r8-node-01 ~]# rpm -q pcs
pcs-0.10.13-1.el8.x86_64

1) enabled with devices

[root@r8-node-01 ~]# pcs stonith sbd config
SBD_DELAY_START=no
SBD_STARTMODE=always
SBD_WATCHDOG_TIMEOUT=5

Watchdogs:
  r8-node-01: /dev/watchdog
  r8-node-02: /dev/watchdog
  r8-node-03: /dev/watchdog

Devices:
  r8-node-01: "/dev/disk/by-id/scsi-3600140500e2fe60a3eb479bb39ca8d3d"
  r8-node-02: "/dev/disk/by-id/scsi-3600140500e2fe60a3eb479bb39ca8d3d"
  r8-node-03: "/dev/disk/by-id/scsi-3600140500e2fe60a3eb479bb39ca8d3d"

[root@r8-node-01 ~]# pcs property set stonith-watchdog-timeout=-1
Error: stonith-watchdog-timeout can only be unset or set to 0 while SBD is enabled with devices, use --force to override
[root@r8-node-01 ~]# pcs property set stonith-watchdog-timeout=0
[root@r8-node-01 ~]# pcs property | grep stonith-watchdog-timeout
 stonith-watchdog-timeout: 0
[root@r8-node-01 ~]# pcs property set stonith-watchdog-timeout=
[root@r8-node-01 ~]# pcs property | grep stonith-watchdog-timeout

2) enabled without devices

[root@r8-node-01 ~]# pcs stonith sbd config
SBD_DELAY_START=no
SBD_STARTMODE=always
SBD_WATCHDOG_TIMEOUT=5

Watchdogs:
  r8-node-01: /dev/watchdog
  r8-node-02: /dev/watchdog
  r8-node-03: /dev/watchdog

[root@r8-node-01 ~]# pcs property set stonith-watchdog-timeout=3
Error: The stonith-watchdog-timeout must be greater than SBD watchdog timeout '5', entered '3', use --force to override
[root@r8-node-01 ~]# pcs property set stonith-watchdog-timeout=-1
Error: The stonith-watchdog-timeout must be greater than SBD watchdog timeout '5', entered '-1', use --force to override
[root@r8-node-01 ~]# pcs property set stonith-watchdog-timeout=10
[root@r8-node-01 ~]# pcs property | grep stonith-watchdog-timeout
 stonith-watchdog-timeout: 10

3) disabled

[root@r8-node-01 ~]# pcs stonith sbd status
SBD STATUS
<node name>: <installed> | <enabled> | <running>
r8-node-03: YES |  NO |  NO
r8-node-01: YES |  NO |  NO
r8-node-02: YES |  NO |  NO

[root@r8-node-01 ~]# pcs property set stonith-watchdog-timeout=-1
Error: stonith-watchdog-timeout can only be unset or set to 0 while SBD is disabled
[root@r8-node-01 ~]# pcs property | grep stonith-watchdog-timeout
 stonith-watchdog-timeout: 0
[root@r8-node-01 ~]# pcs property set stonith-watchdog-timeout=
[root@r8-node-01 ~]# pcs property | grep stonith-watchdog-timeout

Comment 28 Klaus Wenninger 2022-11-02 09:51:53 UTC
*** Bug 1952140 has been marked as a duplicate of this bug. ***

Comment 30 errata-xmlrpc 2022-11-08 09:12:53 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: pcs security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:7447

Comment 31 Red Hat Bugzilla 2023-09-18 00:26:08 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days