1954099 – Prevent fence_sbd in combination with stonith-watchdog-timeout>0

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1954099 - Prevent fence_sbd in combination with stonith-watchdog-timeout>0

Summary: Prevent fence_sbd in combination with stonith-watchdog-timeout>0

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 8
Classification:	Red Hat
Component:	pcs
Sub Component:
Version:	8.4
Hardware:	Unspecified
OS:	Unspecified
Priority:	urgent
Severity:	urgent
Target Milestone:	beta
Target Release:	8.7
Assignee:	Tomas Jelinek
QA Contact:	cluster-qe@redhat.com
Docs Contact:	Steven J. Levine
URL:
Whiteboard:
Duplicates (1):	1952140 (view as bug list)
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2021-04-27 15:16 UTC by Nina Hostakova
Modified:	2023-09-18 00:26 UTC (History)
CC List:	11 users (show)
Fixed In Version:	pcs-0.10.13-1.el8
Doc Type:	Bug Fix
Doc Text:	.`pcs` now validates the value of `stonith-watchdog-timeout` Previously, it was possible to set the `stonith-watchdog-timeout` property to a value that is incompatible with SBD configuration. This could result in a fence loop, or could cause the cluster to consider a fencing action to be successful even if the action is not finished. With this fix, `pcs` validates the value of `stonith-watchdog-property` when you set it, to prevent incorrect configuration.
Clone Of:
Clones:	2058246 (view as bug list)
Environment:
Last Closed:	2022-11-08 09:12:53 UTC
Type:	Bug
Target Upstream Version:
Embargoed:
Dependent Products:
Flags:	pm-rhel: mirror+

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Knowledge Base (Article)	2941601	0	None	None	None	2022-11-09 17:52:46 UTC
Red Hat Product Errata	RHSA-2022:7447	0	None	None	None	2022-11-08 09:13:11 UTC

Comment 2 Tomas Jelinek 2021-04-29 08:23:59 UTC

There are cluster properties managed automatically by pacemaker or pcs: cluster-infrastructure, cluster-name, dc-version, have-watchdog, last-lrm-refresh, stonith-watchdog-timeout. The 'pcs property' commands could check if users are trying to modify those, print an error message saying those are managed automatically, and require --force to proceed changing them.

Comment 3 Nina Hostakova 2021-04-29 12:30:16 UTC

This approach would probably not work for watchdog-only sbd fencing where, on the contrary, the stonith-watchdog-timeout property needs to be set up manually so that watchdog fencing works properly.

Comment 4 Nina Hostakova 2021-04-30 12:18:16 UTC

Digging more into stonith-watchdog-timeout, we have also found issues when configuring the property with pcs for sbd watchdog fencing (no disks):


1. When setting up stonith-watchdog-timeout property, the value needs to exceed SBD_WATCHDOG_TIMEOUT. If it doesn't, the cluster will end up in a fencing loop on all nodes (if cluster enabled). 'pcs property set' should give an error and forbid to set it instead

[root@virt-247 ~]# pcs stonith sbd config
SBD_DELAY_START=no
SBD_STARTMODE=always
SBD_WATCHDOG_TIMEOUT=5

Watchdogs:
  virt-247: /dev/watchdog
  virt-246: /dev/watchdog
  virt-248: /dev/watchdog

[root@virt-247 ~]# pcs property show --all | grep stonith-watchdog-timeout
 stonith-watchdog-timeout: 0

[root@virt-247 ~]# pcs property set stonith-watchdog-timeout=3

Broadcast message from systemd-journald.lab.eng.brq.redhat.com (Fri 2021-04-30 11:59:31 CEST):

pacemaker-controld[7343]:  emerg: Shutting down: stonith-watchdog-timeout (3) too short (must be >5000ms)


2. If stonith-watchdog-timeout property is set to a negative number, pcs rejects this value with an error (--force need to be used), even though it should be a supported configuration:

# man pacemaker-controld
...
If `stonith-watchdog-timeout` is set to a negative value, and
`SBD_WATCHDOG_TIMEOUT` is set, twice that value will be used.
...
[root@virt-023 ~]# pcs property set stonith-watchdog-timeout=-1
Error: invalid value of property: 'stonith-watchdog-timeout=-1', (use --force to override)
[root@virt-023 ~]# echo $?
1


The scope of the original bz is only to prevent misconfig for sbd with disks. So now the question is, if to use this bz for fixing overal pcs validation of the stonith-watchdog-timeout property or to create separate bzs for individual issues. Depends which approach works better for pcs.

Comment 6 Tomas Jelinek 2021-05-06 09:15:20 UTC

To summarize things up:
* When SBD is used without devices, then stonith-watchdog-timeout must be set to a value greater than SBD_WATCHDOG_TIMEOUT. This cannot be done automatically by pcs, as the property must be set after the cluster is restarted. The restart is not done automatically by pcs and is left to be done by users so that it does not disrupt cluster operation.
* When SBD is used with devices, then stonith-watchdog-timeout must not be set to value greater than 0.

Action items:
When stonith-watchdog-timeout property is being set by a user, check whether SBD is used with or without devices. If devices are used, prevent the property to be set if its value is not 0 or empty. If devices are not used, prevent the property to be set if its value is not greater than SBD_WATCHDOG_TIMEOUT. If SBD is not used at all, prevent the property to be set if its value is not 0 or empty.

Comment 7 Nina Hostakova 2021-05-06 11:39:58 UTC

Tomas, thanks for summing up, that is exactly what we think should be done.

Comment 12 Tomas Jelinek 2022-03-28 12:04:02 UTC

Upstream patch: https://github.com/ClusterLabs/pcs/commit/f3561eabe69cd3584673040780900c589f64f3b4

Test:
Using 'pcs property set stonith-watchdog-timeout=<value>', set stonith-watchdog-timeout to
* 0,
* a value greater than SBD_WATCHDOG_TIMEOUT,
* a value not greater than SBD_WATCHDOG_TIMEOUT,
* an empty value (unset).
Do it while:
* SBD is disabled,
* SBD is enabled with no devices,
* SBD is enabled with devices.
Verify that pcs returns an error and doesn't set the property in situation which would lead to fence loops or unreliable fencing.

Comment 14 Miroslav Lisik 2022-05-26 08:42:04 UTC

DevTestResults:

[root@r8-node-01 ~]# rpm -q pcs
pcs-0.10.13-1.el8.x86_64

1) enabled with devices

[root@r8-node-01 ~]# pcs stonith sbd config
SBD_DELAY_START=no
SBD_STARTMODE=always
SBD_WATCHDOG_TIMEOUT=5

Watchdogs:
  r8-node-01: /dev/watchdog
  r8-node-02: /dev/watchdog
  r8-node-03: /dev/watchdog

Devices:
  r8-node-01: "/dev/disk/by-id/scsi-3600140500e2fe60a3eb479bb39ca8d3d"
  r8-node-02: "/dev/disk/by-id/scsi-3600140500e2fe60a3eb479bb39ca8d3d"
  r8-node-03: "/dev/disk/by-id/scsi-3600140500e2fe60a3eb479bb39ca8d3d"

[root@r8-node-01 ~]# pcs property set stonith-watchdog-timeout=-1
Error: stonith-watchdog-timeout can only be unset or set to 0 while SBD is enabled with devices, use --force to override
[root@r8-node-01 ~]# pcs property set stonith-watchdog-timeout=0
[root@r8-node-01 ~]# pcs property | grep stonith-watchdog-timeout
 stonith-watchdog-timeout: 0
[root@r8-node-01 ~]# pcs property set stonith-watchdog-timeout=
[root@r8-node-01 ~]# pcs property | grep stonith-watchdog-timeout

2) enabled without devices

[root@r8-node-01 ~]# pcs stonith sbd config
SBD_DELAY_START=no
SBD_STARTMODE=always
SBD_WATCHDOG_TIMEOUT=5

Watchdogs:
  r8-node-01: /dev/watchdog
  r8-node-02: /dev/watchdog
  r8-node-03: /dev/watchdog

[root@r8-node-01 ~]# pcs property set stonith-watchdog-timeout=3
Error: The stonith-watchdog-timeout must be greater than SBD watchdog timeout '5', entered '3', use --force to override
[root@r8-node-01 ~]# pcs property set stonith-watchdog-timeout=-1
Error: The stonith-watchdog-timeout must be greater than SBD watchdog timeout '5', entered '-1', use --force to override
[root@r8-node-01 ~]# pcs property set stonith-watchdog-timeout=10
[root@r8-node-01 ~]# pcs property | grep stonith-watchdog-timeout
 stonith-watchdog-timeout: 10

3) disabled

[root@r8-node-01 ~]# pcs stonith sbd status
SBD STATUS
<node name>: <installed> | <enabled> | <running>
r8-node-03: YES |  NO |  NO
r8-node-01: YES |  NO |  NO
r8-node-02: YES |  NO |  NO

[root@r8-node-01 ~]# pcs property set stonith-watchdog-timeout=-1
Error: stonith-watchdog-timeout can only be unset or set to 0 while SBD is disabled
[root@r8-node-01 ~]# pcs property | grep stonith-watchdog-timeout
 stonith-watchdog-timeout: 0
[root@r8-node-01 ~]# pcs property set stonith-watchdog-timeout=
[root@r8-node-01 ~]# pcs property | grep stonith-watchdog-timeout

Comment 28 Klaus Wenninger 2022-11-02 09:51:53 UTC

*** Bug 1952140 has been marked as a duplicate of this bug. ***

Comment 30 errata-xmlrpc 2022-11-08 09:12:53 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: pcs security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:7447

Comment 31 Red Hat Bugzilla 2023-09-18 00:26:08 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days

Note You need to log in before you can comment on or make changes to this bug.