2055935 – A stonith device added while stonith-enabled=false is not available to stonith_admin if it fails to start

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 2055935 - A stonith device added while stonith-enabled=false is not available to stonith_admin if it fails to start

Summary: A stonith device added while stonith-enabled=false is not available to stonit...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 8
Classification:	Red Hat
Component:	pacemaker
Sub Component:
Version:	8.5
Hardware:	All
OS:	Linux
Priority:	low
Severity:	low
Target Milestone:	rc
Target Release:	8.7
Assignee:	Christine Caulfield
QA Contact:	cluster-qe@redhat.com
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2022-02-18 00:45 UTC by Reid Wahl
Modified:	2022-11-08 10:39 UTC (History)
CC List:	5 users (show)
Fixed In Version:	pacemaker-2.1.4-1.el8
Doc Type:	Bug Fix
Doc Text:	Cause: Pacemaker's fencer would not process fence device configuration changes while the stonith-enabled cluster property is false. Consequence: Manual fencing (executed via stonith_admin or the pcs stonith fence command), which is unaffected by stonith-enabled, could use an outdated fence device configuration. Fix: The fencer processes configuration changes even if stonith-enabled is false. Result: Manual fencing always uses the current device configuration.
Clone Of:
Environment:
Last Closed:	2022-11-08 09:42:25 UTC
Type:	Bug
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Red Hat Issue Tracker	RHELPLAN-112802	None	None	None	2022-02-18 00:46:27 UTC
Red Hat Knowledge Base (Solution)	6877901	None	None	None	2022-04-01 17:19:05 UTC
Red Hat Product Errata	RHBA-2022:7573	None	None	None	2022-11-08 09:42:42 UTC

Description Reid Wahl 2022-02-18 00:45:11 UTC

Description of problem:

This is a big-time edge case, and we don't even support clusters with stonith-enabled=false, so this would be more for upstream. This is interesting behavior though. I'm raising it as an RHBZ rather than CLBZ only because a customer encountered it while troubleshooting a stonith device.

If a stonith device is created while stonith-enabled=false, and that stonith device fails to start, then `pcs stonith fence`/stonith_admin doesn't find it when it looks for devices capable of fencing a node.

In contrast, stonith_admin **does** find it in the following scenarios:
  - the device starts successfully while stonith-enabled=false; or
  - the device is created while stonith-enabled=true, it immediately fails to start, and then stonith-enabled is set to false before calling stonith_admin.


Brief demo:

[root@fastvm-rhel-8-0-23 ~]# pcs stonith config
[root@fastvm-rhel-8-0-23 ~]# pcs property set stonith-enabled=false
[root@fastvm-rhel-8-0-23 ~]# pcs stonith create vmfence fence_vmware_soap ip=1.1.1.1 login=asdf passwd=asdf pcmk_host_map='node1:node1;node2:node2'

[root@fastvm-rhel-8-0-23 ~]# pcs stonith status
  * vmfence	(stonith:fence_vmware_soap):	 Stopped

[root@fastvm-rhel-8-0-23 ~]# crm_mon --one-shot -U all -I failures
Failed Resource Actions:
  * vmfence_start_0 on node1 'error' (1): call=57, status='complete', exitreason='', last-rc-change='2022-02-17 16:29:12 -08:00', queued=0ms, exec=1649ms

[root@fastvm-rhel-8-0-23 ~]# pcs stonith fence node2 & tail -f /var/log/messages
...
Feb 17 16:32:48 fastvm-rhel-8-0-23 pacemaker-fenced[1402]: notice: Client stonith_admin.409161 wants to fence (reboot) node2 using any device
Feb 17 16:32:48 fastvm-rhel-8-0-23 pacemaker-fenced[1402]: notice: Requesting peer fencing (reboot) targeting node2
Feb 17 16:32:48 fastvm-rhel-8-0-23 pacemaker-fenced[1402]: notice: Couldn't find anyone to fence (reboot) node2 using any device
Feb 17 16:32:48 fastvm-rhel-8-0-23 pacemaker-fenced[1402]: error: Operation 'reboot' targeting node2 by unknown node for stonith_admin.409161@node1: No such device
Feb 17 16:32:48 fastvm-rhel-8-0-23 pacemaker-controld[1406]: notice: Peer node2 was not terminated (reboot) by <anyone> on behalf of stonith_admin.409161: No such device


For brevity, I'm not including the demos of the two scenarios mentioned above where the device is found. I can demonstrate if needed.


There is a related issue that is arguably a bit worse: an old (possibly working) stonith device may appear in the list of available devices in this kind of scenario. In the demo below, the old working stonith device is xvm. I set stonith-enabled=false, deleted xvm, created vmfence, and ran pcs stonith fence. The fencer found xvm and used it to reboot node2.

[root@fastvm-rhel-8-0-23 ~]# pcs stonith config
[root@fastvm-rhel-8-0-23 ~]# pcs stonith create xvm fence_xvm pcmk_host_map='node1:fastvm-rhel-8.0-23;node2:fastvm-rhel-8.0-24'
[root@fastvm-rhel-8-0-23 ~]# pcs stonith status
  * xvm	(stonith:fence_xvm):	 Started node1

[root@fastvm-rhel-8-0-23 ~]# pcs property set stonith-enabled=false
[root@fastvm-rhel-8-0-23 ~]# pcs stonith delete xvm
Attempting to stop: xvm... Stopped

[root@fastvm-rhel-8-0-23 ~]# pcs stonith create vmfence fence_vmware_soap ip=1.1.1.1 login=asdf passwd=asdf pcmk_host_map='node1:node1;node2:node2'
[root@fastvm-rhel-8-0-23 ~]# pcs stonith status
  * vmfence	(stonith:fence_vmware_soap):	 Stopped
[root@fastvm-rhel-8-0-23 ~]# crm_mon --one-shot -U all -I failures
Failed Resource Actions:
  * vmfence_start_0 on node1 'error' (1): call=73, status='complete', exitreason='', last-rc-change='2022-02-17 16:36:34 -08:00', queued=0ms, exec=1599ms

[root@fastvm-rhel-8-0-23 ~]# pcs stonith fence node2 & tail -f /var/log/messages 
[1] 409343
...
Feb 17 16:36:45 fastvm-rhel-8-0-23 pacemaker-fenced[1402]: notice: Client stonith_admin.409345 wants to fence (reboot) node2 using any device
Feb 17 16:36:45 fastvm-rhel-8-0-23 pacemaker-fenced[1402]: notice: Requesting peer fencing (reboot) targeting node2
Feb 17 16:36:45 fastvm-rhel-8-0-23 pacemaker-fenced[1402]: notice: xvm is eligible to fence (reboot) node2 (aka. 'fastvm-rhel-8.0-24'): static-list
Feb 17 16:36:45 fastvm-rhel-8-0-23 pacemaker-fenced[1402]: notice: Requesting that node1 perform 'reboot' action targeting node2
Feb 17 16:36:45 fastvm-rhel-8-0-23 pacemaker-fenced[1402]: notice: xvm is eligible to fence (reboot) node2 (aka. 'fastvm-rhel-8.0-24'): static-list
Node: node2 fenced
Feb 17 16:36:47 fastvm-rhel-8-0-23 fence_xvm[409346]: Domain "fastvm-rhel-8.0-24" is ON
Feb 17 16:36:47 fastvm-rhel-8-0-23 pacemaker-fenced[1402]: notice: Operation 'reboot' [409346] (call 2 from stonith_admin.409345) targeting node2 using xvm returned 0 (OK)
Feb 17 16:36:47 fastvm-rhel-8-0-23 pacemaker-fenced[1402]: notice: Operation 'reboot' targeting node2 by node1 for stonith_admin.409345@node1: OK
Feb 17 16:36:47 fastvm-rhel-8-0-23 pacemaker-controld[1406]: notice: Peer node2 was terminated (reboot) by node1 on behalf of stonith_admin.409345: OK

-----

Version-Release number of selected component (if applicable):

pacemaker-2.1.0-8.el8

-----

How reproducible:

Always

-----

Steps to Reproduce:

1. Start with no stonith devices configured and stonith-enabled=true.
2. `pcs property set stonith-enabled=false`
3. Create a stonith device that's configured so that it will fail to start, and with pcmk_host_map or pcmk_host_list configured so that it's capable of fencing node2.
4. `pcs stonith fence node2`

-----

Actual results:

The stonith device is not found by the fencer as capable of fencing node2.

-----

Expected results:

The stonith device is found by the fencer as capable of fencing node2.

-----

Additional info:

The customer who encountered this is on RHEL 7, but there's no reason to consider fixing it there.

Comment 1 Ken Gaillot 2022-03-04 17:45:48 UTC

I do think the second issue is the more important one, but both need to be taken care of. For the second issue, the fencer explicitly ignores CIB updates when stonith-enabled is false, which is probably not a good idea for this reason.

Comment 2 Ken Gaillot 2022-05-09 16:09:31 UTC

QE: The two scenarios here both involve manual fencing (via the stonith_admin command or the pcs stonith fence command).

1. If a stonith device is created while stonith-enabled=false, and that stonith device fails to start, then manual fencing doesn't find the device when it looks for devices capable of fencing a node.

2. Manual fencing might use a deleted stonith device if the device was deleted while stonith-enabled=false.

Comment 3 Ken Gaillot 2022-06-09 14:12:02 UTC

Fixed in upstream main branch as of commit c600ef4

Comment 8 Markéta Smazová 2022-07-18 13:49:15 UTC

Verified as SanityOnly in pacemaker-2.1.4-1.el8

Comment 12 errata-xmlrpc 2022-11-08 09:42:25 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (pacemaker bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2022:7573

Note You need to log in before you can comment on or make changes to this bug.