Bug 2055935

Summary: A stonith device added while stonith-enabled=false is not available to stonith_admin if it fails to start
Product: Red Hat Enterprise Linux 8 Reporter: Reid Wahl <nwahl>
Component: pacemakerAssignee: Christine Caulfield <ccaulfie>
Status: CLOSED ERRATA QA Contact: cluster-qe <cluster-qe>
Severity: low Docs Contact:
Priority: low    
Version: 8.5CC: cluster-maint, kgaillot, msmazova, sbradley, slevine
Target Milestone: rcKeywords: Triaged
Target Release: 8.7   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: pacemaker-2.1.4-1.el8 Doc Type: Bug Fix
Doc Text:
Cause: Pacemaker's fencer would not process fence device configuration changes while the stonith-enabled cluster property is false. Consequence: Manual fencing (executed via stonith_admin or the pcs stonith fence command), which is unaffected by stonith-enabled, could use an outdated fence device configuration. Fix: The fencer processes configuration changes even if stonith-enabled is false. Result: Manual fencing always uses the current device configuration.
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-11-08 09:42:25 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Reid Wahl 2022-02-18 00:45:11 UTC
Description of problem:

This is a big-time edge case, and we don't even support clusters with stonith-enabled=false, so this would be more for upstream. This is interesting behavior though. I'm raising it as an RHBZ rather than CLBZ only because a customer encountered it while troubleshooting a stonith device.

If a stonith device is created while stonith-enabled=false, and that stonith device fails to start, then `pcs stonith fence`/stonith_admin doesn't find it when it looks for devices capable of fencing a node.

In contrast, stonith_admin **does** find it in the following scenarios:
  - the device starts successfully while stonith-enabled=false; or
  - the device is created while stonith-enabled=true, it immediately fails to start, and then stonith-enabled is set to false before calling stonith_admin.


Brief demo:

[root@fastvm-rhel-8-0-23 ~]# pcs stonith config
[root@fastvm-rhel-8-0-23 ~]# pcs property set stonith-enabled=false
[root@fastvm-rhel-8-0-23 ~]# pcs stonith create vmfence fence_vmware_soap ip=1.1.1.1 login=asdf passwd=asdf pcmk_host_map='node1:node1;node2:node2'

[root@fastvm-rhel-8-0-23 ~]# pcs stonith status
  * vmfence	(stonith:fence_vmware_soap):	 Stopped

[root@fastvm-rhel-8-0-23 ~]# crm_mon --one-shot -U all -I failures
Failed Resource Actions:
  * vmfence_start_0 on node1 'error' (1): call=57, status='complete', exitreason='', last-rc-change='2022-02-17 16:29:12 -08:00', queued=0ms, exec=1649ms

[root@fastvm-rhel-8-0-23 ~]# pcs stonith fence node2 & tail -f /var/log/messages
...
Feb 17 16:32:48 fastvm-rhel-8-0-23 pacemaker-fenced[1402]: notice: Client stonith_admin.409161 wants to fence (reboot) node2 using any device
Feb 17 16:32:48 fastvm-rhel-8-0-23 pacemaker-fenced[1402]: notice: Requesting peer fencing (reboot) targeting node2
Feb 17 16:32:48 fastvm-rhel-8-0-23 pacemaker-fenced[1402]: notice: Couldn't find anyone to fence (reboot) node2 using any device
Feb 17 16:32:48 fastvm-rhel-8-0-23 pacemaker-fenced[1402]: error: Operation 'reboot' targeting node2 by unknown node for stonith_admin.409161@node1: No such device
Feb 17 16:32:48 fastvm-rhel-8-0-23 pacemaker-controld[1406]: notice: Peer node2 was not terminated (reboot) by <anyone> on behalf of stonith_admin.409161: No such device


For brevity, I'm not including the demos of the two scenarios mentioned above where the device is found. I can demonstrate if needed.


There is a related issue that is arguably a bit worse: an old (possibly working) stonith device may appear in the list of available devices in this kind of scenario. In the demo below, the old working stonith device is xvm. I set stonith-enabled=false, deleted xvm, created vmfence, and ran pcs stonith fence. The fencer found xvm and used it to reboot node2.

[root@fastvm-rhel-8-0-23 ~]# pcs stonith config
[root@fastvm-rhel-8-0-23 ~]# pcs stonith create xvm fence_xvm pcmk_host_map='node1:fastvm-rhel-8.0-23;node2:fastvm-rhel-8.0-24'
[root@fastvm-rhel-8-0-23 ~]# pcs stonith status
  * xvm	(stonith:fence_xvm):	 Started node1

[root@fastvm-rhel-8-0-23 ~]# pcs property set stonith-enabled=false
[root@fastvm-rhel-8-0-23 ~]# pcs stonith delete xvm
Attempting to stop: xvm... Stopped

[root@fastvm-rhel-8-0-23 ~]# pcs stonith create vmfence fence_vmware_soap ip=1.1.1.1 login=asdf passwd=asdf pcmk_host_map='node1:node1;node2:node2'
[root@fastvm-rhel-8-0-23 ~]# pcs stonith status
  * vmfence	(stonith:fence_vmware_soap):	 Stopped
[root@fastvm-rhel-8-0-23 ~]# crm_mon --one-shot -U all -I failures
Failed Resource Actions:
  * vmfence_start_0 on node1 'error' (1): call=73, status='complete', exitreason='', last-rc-change='2022-02-17 16:36:34 -08:00', queued=0ms, exec=1599ms

[root@fastvm-rhel-8-0-23 ~]# pcs stonith fence node2 & tail -f /var/log/messages 
[1] 409343
...
Feb 17 16:36:45 fastvm-rhel-8-0-23 pacemaker-fenced[1402]: notice: Client stonith_admin.409345 wants to fence (reboot) node2 using any device
Feb 17 16:36:45 fastvm-rhel-8-0-23 pacemaker-fenced[1402]: notice: Requesting peer fencing (reboot) targeting node2
Feb 17 16:36:45 fastvm-rhel-8-0-23 pacemaker-fenced[1402]: notice: xvm is eligible to fence (reboot) node2 (aka. 'fastvm-rhel-8.0-24'): static-list
Feb 17 16:36:45 fastvm-rhel-8-0-23 pacemaker-fenced[1402]: notice: Requesting that node1 perform 'reboot' action targeting node2
Feb 17 16:36:45 fastvm-rhel-8-0-23 pacemaker-fenced[1402]: notice: xvm is eligible to fence (reboot) node2 (aka. 'fastvm-rhel-8.0-24'): static-list
Node: node2 fenced
Feb 17 16:36:47 fastvm-rhel-8-0-23 fence_xvm[409346]: Domain "fastvm-rhel-8.0-24" is ON
Feb 17 16:36:47 fastvm-rhel-8-0-23 pacemaker-fenced[1402]: notice: Operation 'reboot' [409346] (call 2 from stonith_admin.409345) targeting node2 using xvm returned 0 (OK)
Feb 17 16:36:47 fastvm-rhel-8-0-23 pacemaker-fenced[1402]: notice: Operation 'reboot' targeting node2 by node1 for stonith_admin.409345@node1: OK
Feb 17 16:36:47 fastvm-rhel-8-0-23 pacemaker-controld[1406]: notice: Peer node2 was terminated (reboot) by node1 on behalf of stonith_admin.409345: OK

-----

Version-Release number of selected component (if applicable):

pacemaker-2.1.0-8.el8

-----

How reproducible:

Always

-----

Steps to Reproduce:

1. Start with no stonith devices configured and stonith-enabled=true.
2. `pcs property set stonith-enabled=false`
3. Create a stonith device that's configured so that it will fail to start, and with pcmk_host_map or pcmk_host_list configured so that it's capable of fencing node2.
4. `pcs stonith fence node2`

-----

Actual results:

The stonith device is not found by the fencer as capable of fencing node2.

-----

Expected results:

The stonith device is found by the fencer as capable of fencing node2.

-----

Additional info:

The customer who encountered this is on RHEL 7, but there's no reason to consider fixing it there.

Comment 1 Ken Gaillot 2022-03-04 17:45:48 UTC
I do think the second issue is the more important one, but both need to be taken care of. For the second issue, the fencer explicitly ignores CIB updates when stonith-enabled is false, which is probably not a good idea for this reason.

Comment 2 Ken Gaillot 2022-05-09 16:09:31 UTC
QE: The two scenarios here both involve manual fencing (via the stonith_admin command or the pcs stonith fence command).

1. If a stonith device is created while stonith-enabled=false, and that stonith device fails to start, then manual fencing doesn't find the device when it looks for devices capable of fencing a node.

2. Manual fencing might use a deleted stonith device if the device was deleted while stonith-enabled=false.

Comment 3 Ken Gaillot 2022-06-09 14:12:02 UTC
Fixed in upstream main branch as of commit c600ef4

Comment 8 Markéta Smazová 2022-07-18 13:49:15 UTC
Verified as SanityOnly in pacemaker-2.1.4-1.el8

Comment 12 errata-xmlrpc 2022-11-08 09:42:25 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (pacemaker bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2022:7573