Hide Forgot
There are cluster properties managed automatically by pacemaker or pcs: cluster-infrastructure, cluster-name, dc-version, have-watchdog, last-lrm-refresh, stonith-watchdog-timeout. The 'pcs property' commands could check if users are trying to modify those, print an error message saying those are managed automatically, and require --force to proceed changing them.
This approach would probably not work for watchdog-only sbd fencing where, on the contrary, the stonith-watchdog-timeout property needs to be set up manually so that watchdog fencing works properly.
Digging more into stonith-watchdog-timeout, we have also found issues when configuring the property with pcs for sbd watchdog fencing (no disks): 1. When setting up stonith-watchdog-timeout property, the value needs to exceed SBD_WATCHDOG_TIMEOUT. If it doesn't, the cluster will end up in a fencing loop on all nodes (if cluster enabled). 'pcs property set' should give an error and forbid to set it instead [root@virt-247 ~]# pcs stonith sbd config SBD_DELAY_START=no SBD_STARTMODE=always SBD_WATCHDOG_TIMEOUT=5 Watchdogs: virt-247: /dev/watchdog virt-246: /dev/watchdog virt-248: /dev/watchdog [root@virt-247 ~]# pcs property show --all | grep stonith-watchdog-timeout stonith-watchdog-timeout: 0 [root@virt-247 ~]# pcs property set stonith-watchdog-timeout=3 Broadcast message from systemd-journald.lab.eng.brq.redhat.com (Fri 2021-04-30 11:59:31 CEST): pacemaker-controld[7343]: emerg: Shutting down: stonith-watchdog-timeout (3) too short (must be >5000ms) 2. If stonith-watchdog-timeout property is set to a negative number, pcs rejects this value with an error (--force need to be used), even though it should be a supported configuration: # man pacemaker-controld ... If `stonith-watchdog-timeout` is set to a negative value, and `SBD_WATCHDOG_TIMEOUT` is set, twice that value will be used. ... [root@virt-023 ~]# pcs property set stonith-watchdog-timeout=-1 Error: invalid value of property: 'stonith-watchdog-timeout=-1', (use --force to override) [root@virt-023 ~]# echo $? 1 The scope of the original bz is only to prevent misconfig for sbd with disks. So now the question is, if to use this bz for fixing overal pcs validation of the stonith-watchdog-timeout property or to create separate bzs for individual issues. Depends which approach works better for pcs.
To summarize things up: * When SBD is used without devices, then stonith-watchdog-timeout must be set to a value greater than SBD_WATCHDOG_TIMEOUT. This cannot be done automatically by pcs, as the property must be set after the cluster is restarted. The restart is not done automatically by pcs and is left to be done by users so that it does not disrupt cluster operation. * When SBD is used with devices, then stonith-watchdog-timeout must not be set to value greater than 0. Action items: When stonith-watchdog-timeout property is being set by a user, check whether SBD is used with or without devices. If devices are used, prevent the property to be set if its value is not 0 or empty. If devices are not used, prevent the property to be set if its value is not greater than SBD_WATCHDOG_TIMEOUT. If SBD is not used at all, prevent the property to be set if its value is not 0 or empty.
Tomas, thanks for summing up, that is exactly what we think should be done.
Upstream patch: https://github.com/ClusterLabs/pcs/commit/f3561eabe69cd3584673040780900c589f64f3b4 Test: Using 'pcs property set stonith-watchdog-timeout=<value>', set stonith-watchdog-timeout to * 0, * a value greater than SBD_WATCHDOG_TIMEOUT, * a value not greater than SBD_WATCHDOG_TIMEOUT, * an empty value (unset). Do it while: * SBD is disabled, * SBD is enabled with no devices, * SBD is enabled with devices. Verify that pcs returns an error and doesn't set the property in situation which would lead to fence loops or unreliable fencing.
DevTestResults: [root@r8-node-01 ~]# rpm -q pcs pcs-0.10.13-1.el8.x86_64 1) enabled with devices [root@r8-node-01 ~]# pcs stonith sbd config SBD_DELAY_START=no SBD_STARTMODE=always SBD_WATCHDOG_TIMEOUT=5 Watchdogs: r8-node-01: /dev/watchdog r8-node-02: /dev/watchdog r8-node-03: /dev/watchdog Devices: r8-node-01: "/dev/disk/by-id/scsi-3600140500e2fe60a3eb479bb39ca8d3d" r8-node-02: "/dev/disk/by-id/scsi-3600140500e2fe60a3eb479bb39ca8d3d" r8-node-03: "/dev/disk/by-id/scsi-3600140500e2fe60a3eb479bb39ca8d3d" [root@r8-node-01 ~]# pcs property set stonith-watchdog-timeout=-1 Error: stonith-watchdog-timeout can only be unset or set to 0 while SBD is enabled with devices, use --force to override [root@r8-node-01 ~]# pcs property set stonith-watchdog-timeout=0 [root@r8-node-01 ~]# pcs property | grep stonith-watchdog-timeout stonith-watchdog-timeout: 0 [root@r8-node-01 ~]# pcs property set stonith-watchdog-timeout= [root@r8-node-01 ~]# pcs property | grep stonith-watchdog-timeout 2) enabled without devices [root@r8-node-01 ~]# pcs stonith sbd config SBD_DELAY_START=no SBD_STARTMODE=always SBD_WATCHDOG_TIMEOUT=5 Watchdogs: r8-node-01: /dev/watchdog r8-node-02: /dev/watchdog r8-node-03: /dev/watchdog [root@r8-node-01 ~]# pcs property set stonith-watchdog-timeout=3 Error: The stonith-watchdog-timeout must be greater than SBD watchdog timeout '5', entered '3', use --force to override [root@r8-node-01 ~]# pcs property set stonith-watchdog-timeout=-1 Error: The stonith-watchdog-timeout must be greater than SBD watchdog timeout '5', entered '-1', use --force to override [root@r8-node-01 ~]# pcs property set stonith-watchdog-timeout=10 [root@r8-node-01 ~]# pcs property | grep stonith-watchdog-timeout stonith-watchdog-timeout: 10 3) disabled [root@r8-node-01 ~]# pcs stonith sbd status SBD STATUS <node name>: <installed> | <enabled> | <running> r8-node-03: YES | NO | NO r8-node-01: YES | NO | NO r8-node-02: YES | NO | NO [root@r8-node-01 ~]# pcs property set stonith-watchdog-timeout=-1 Error: stonith-watchdog-timeout can only be unset or set to 0 while SBD is disabled [root@r8-node-01 ~]# pcs property | grep stonith-watchdog-timeout stonith-watchdog-timeout: 0 [root@r8-node-01 ~]# pcs property set stonith-watchdog-timeout= [root@r8-node-01 ~]# pcs property | grep stonith-watchdog-timeout
*** Bug 1952140 has been marked as a duplicate of this bug. ***
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: pcs security, bug fix, and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:7447