Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.
RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1978013

Summary: Pacemaker can select wrong fence device when pcmk_host_map and dynamic-list are combined
Product: Red Hat Enterprise Linux 9 Reporter: Ken Gaillot <kgaillot>
Component: pacemakerAssignee: Ken Gaillot <kgaillot>
Status: CLOSED CURRENTRELEASE QA Contact: cluster-qe <cluster-qe>
Severity: medium Docs Contact:
Priority: high    
Version: 9.0CC: cluster-maint, cluster-qe, msmazova, phagara
Target Milestone: rcKeywords: Triaged
Target Release: 9.0 BetaFlags: pm-rhel: mirror+
Hardware: All   
OS: All   
Whiteboard:
Fixed In Version: pacemaker-2.1.0-6.el9 Doc Type: Bug Fix
Doc Text:
Cause: If a fence device configured with pcmk_host_check="dynamic-list" failed its list action, and also had a pcmk_host_map configured, Pacemaker would wrongly assume the device could fence all the nodes listed in the host map. Consequence: Pacemaker might wrongly select the device to fence one of the nodes in the host map that it couldn't actually fence. Fix: Pacemaker now does not assume a fence device that fails its list action can fence any hosts. Result: The proper device will be chosen for a node that requires fencing.
Story Points: ---
Clone Of: 1978010 Environment:
Last Closed: 2021-12-07 21:57:54 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1978010    
Bug Blocks:    

Description Ken Gaillot 2021-06-30 22:42:29 UTC
+++ This bug was initially created as a clone of Bug #1978010 +++

Description of problem: If a fencing device is configured with pcmk_host_check set to "dynamic-list", and a pcmk_host_map option, then Pacemaker may wrongly select the device to fence a target in the host map if the device's list action fails.


Version-Release number of selected component (if applicable): all


How reproducible: See below


Steps to Reproduce:
1. Modify a fence agent so that its list, off, and reboot actions always fail (its status action should succeed).
2. Configure a cluster of at least 2 nodes.
3. Configure a standard fence device able to target one of the nodes (no topology). This simulates the scenario where this is the only device capable of fencing the node, so Pacemaker should select this device if the node needs fencing.
4. Remove any monitor operation for the standard fence device, and configure a location constraint preferring the target to run the device. This is a trick to make the device less preferred when more than one device is eligible (because there is no successful monitor, and it is available only from the target).
5. Configure a fence device using the modified fencing agent, pcmk_host_check="dynamic-list", and a pcmk_host_map with entries for all nodes (the alias names won't matter since the agent's list action will always fail). This simulates the scenario where pcmk_host_map includes at least one node the device can't fence (which is realistic since the intent of dynamic-list is that the fence may sometimes be able to fence a node and sometimes not). The idea is that if the list action did succeed, it would output only the alias of the node that doesn't use the standard fence device.
6. Cause fencing to be required for the node with the standard fence device.

Actual results: When the modified agent's list action fails, Pacemaker wrongly assumes the device can fence every node in pcmk_host_map, and selects it for fencing, which fails.

Expected results: Pacemaker always chooses the standard fencing device for the node that can only be fenced by that device.

--- Additional comment from Ken Gaillot on 2021-06-30 22:34:56 UTC ---

This was fixed in the upstream master branch by commit a29f88f

Comment 4 Patrik Hagara 2021-08-24 18:03:15 UTC
before fix
==========

See https://bugzilla.redhat.com/show_bug.cgi?id=1978010#c5


after fix
=========

> [root@virt-513 ~]# rpm -q pacemaker
> pacemaker-2.1.0-11.el9.x86_64


Starting with:
* a 2-node cluster
* dummy fence agent installed on both nodes as /usr/sbin/fence_bz1978010: https://github.com/ClusterLabs/fence-agents/blob/master/agents/dummy/fence_dummy.py
* per-node real fence device configured

> [root@virt-513 ~]# pcs status
> Cluster name: STSRHTS14461
> Cluster Summary:
>   * Stack: corosync
>   * Current DC: virt-514 (version 2.1.0-11.el9-7c3f660707) - partition with quorum
>   * Last updated: Tue Aug 24 19:55:16 2021
>   * Last change:  Tue Aug 24 15:44:18 2021 by root via cibadmin on virt-513
>   * 2 nodes configured
>   * 2 resource instances configured
> 
> Node List:
>   * Online: [ virt-513 virt-514 ]
> 
> Full List of Resources:
>   * fence-virt-513	(stonith:fence_xvm):	 Started virt-513
>   * fence-virt-514	(stonith:fence_xvm):	 Started virt-514
> 
> Daemon Status:
>   corosync: active/enabled
>   pacemaker: active/enabled
>   pcsd: active/enabled


Remove the monitor operation from second node's real fence device:

> [root@virt-513 ~]# pcs cluster cib scope=resources cib.xml
> [root@virt-513 ~]# cp cib.xml cib-updated.xml
> [root@virt-513 ~]# vim cib-updated.xml
> [root@virt-513 ~]# diff cib.xml cib-updated.xml
> 19,21d18
> <     <operations>
> <       <op name="monitor" interval="60s" id="fence-virt-514-monitor-interval-60s"/>
> <     </operations>
> [root@virt-513 ~]# pcs cluster cib-push scope=resources cib-updated.xml
> CIB updated


Make the second node's real fence device prefer the second node:

> [root@virt-513 ~]# pcs constraint location fence-virt-514 prefers virt-514
> [root@virt-513 ~]# pcs constraint list --full
> Warning: This command is deprecated and will be removed. Please use 'pcs constraint config' instead.
> Location Constraints:
>   Resource: fence-virt-514
>     Enabled on:
>       Node: virt-514 (score:INFINITY) (id:location-fence-virt-514-virt-514-INFINITY)
> Ordering Constraints:
> Colocation Constraints:
> Ticket Constraints:


Create a dynamic fence device that always fails using the dummy fence agent:

> [root@virt-513 ~]# pcs stonith create bz1978013 fence_bz1978013 pcmk_host_check="dynamic-list" pcmk_host_map='virt-513:frist;virt-514:second' type=fail


Trigger fencing of the second node:

> [root@virt-513 ~]# pcs stonith fence virt-514
> Node: virt-514 fenced


Excerpt from the pacemaker-fenced log:

> Aug 24 19:58:11.801 virt-513 pacemaker-fenced    [54053] (handle_request) 	notice: Client stonith_admin.70108 wants to fence (reboot) virt-514 using any device
> Aug 24 19:58:11.802 virt-513 pacemaker-fenced    [54053] (initiate_remote_stonith_op) 	notice: Requesting peer fencing (reboot) targeting virt-514 | id=63dd8345 state=querying base_timeout=120
> Aug 24 19:58:11.807 virt-513 pacemaker-fenced    [54053] (can_fence_host_with_device) 	notice: fence-virt-514 is eligible to fence (reboot) virt-514 (aka. 'virt-514.cluster-qe.lab.eng.brq.redhat.com'): static-list
> Aug 24 19:58:11.807 virt-513 pacemaker-fenced    [54053] (can_fence_host_with_device) 	notice: fence-virt-513 is not eligible to fence (reboot) virt-514: static-list
> Aug 24 19:58:11.892 virt-513 pacemaker-fenced    [54053] (log_op_output) 	notice: fence_bz1978013_list_1[70109] error output [ 2021-08-24 19:58:11,881 ERROR: Failed: Unrecognised action 'list' ]
> Aug 24 19:58:11.892 virt-513 pacemaker-fenced    [54053] (log_op_output) 	notice: fence_bz1978013_list_1[70109] error output [  ]
> Aug 24 19:58:11.892 virt-513 pacemaker-fenced    [54053] (log_op_output) 	notice: fence_bz1978013_list_1[70109] error output [ 2021-08-24 19:58:11,883 ERROR: Please use '-h' for usage ]
> Aug 24 19:58:11.892 virt-513 pacemaker-fenced    [54053] (log_op_output) 	notice: fence_bz1978013_list_1[70109] error output [  ]
> Aug 24 19:58:11.892 virt-513 pacemaker-fenced    [54053] (log_action) 	warning: fence_bz1978013[70109] stderr: [ 2021-08-24 19:58:11,881 ERROR: Failed: Unrecognised action 'list' ]
> Aug 24 19:58:11.893 virt-513 pacemaker-fenced    [54053] (log_action) 	warning: fence_bz1978013[70109] stderr: [  ]
> Aug 24 19:58:11.893 virt-513 pacemaker-fenced    [54053] (log_action) 	warning: fence_bz1978013[70109] stderr: [ 2021-08-24 19:58:11,883 ERROR: Please use '-h' for usage ]
> Aug 24 19:58:11.893 virt-513 pacemaker-fenced    [54053] (log_action) 	warning: fence_bz1978013[70109] stderr: [  ]
> Aug 24 19:58:11.893 virt-513 pacemaker-fenced    [54053] (internal_stonith_action_execute) 	info: Attempt 2 to execute fence_bz1978013 (list). remaining timeout is 120
> Aug 24 19:58:12.969 virt-513 pacemaker-fenced    [54053] (process_remote_stonith_query) 	info: Query result 1 of 2 from virt-514 for virt-514/reboot (1 device) 63dd8345-31f3-48d6-ae70-753b3aadab96
> Aug 24 19:58:12.981 virt-513 pacemaker-fenced    [54053] (log_op_output) 	notice: fence_bz1978013_list_2[70112] error output [ 2021-08-24 19:58:12,966 ERROR: Failed: Unrecognised action 'list' ]
> Aug 24 19:58:12.981 virt-513 pacemaker-fenced    [54053] (log_op_output) 	notice: fence_bz1978013_list_2[70112] error output [  ]
> Aug 24 19:58:12.981 virt-513 pacemaker-fenced    [54053] (log_op_output) 	notice: fence_bz1978013_list_2[70112] error output [ 2021-08-24 19:58:12,967 ERROR: Please use '-h' for usage ]
> Aug 24 19:58:12.981 virt-513 pacemaker-fenced    [54053] (log_op_output) 	notice: fence_bz1978013_list_2[70112] error output [  ]
> Aug 24 19:58:12.981 virt-513 pacemaker-fenced    [54053] (log_action) 	warning: fence_bz1978013[70112] stderr: [ 2021-08-24 19:58:12,966 ERROR: Failed: Unrecognised action 'list' ]
> Aug 24 19:58:12.981 virt-513 pacemaker-fenced    [54053] (log_action) 	warning: fence_bz1978013[70112] stderr: [  ]
> Aug 24 19:58:12.982 virt-513 pacemaker-fenced    [54053] (log_action) 	warning: fence_bz1978013[70112] stderr: [ 2021-08-24 19:58:12,967 ERROR: Please use '-h' for usage ]
> Aug 24 19:58:12.982 virt-513 pacemaker-fenced    [54053] (log_action) 	warning: fence_bz1978013[70112] stderr: [  ]
> Aug 24 19:58:12.982 virt-513 pacemaker-fenced    [54053] (update_remaining_timeout) 	info: Attempted to execute agent fence_bz1978013 (list) the maximum number of times (2) allowed
> Aug 24 19:58:12.983 virt-513 pacemaker-fenced    [54053] (process_remote_stonith_query) 	info: Query result 2 of 2 from virt-513 for virt-514/reboot (1 device) 63dd8345-31f3-48d6-ae70-753b3aadab96
> Aug 24 19:58:12.983 virt-513 pacemaker-fenced    [54053] (process_remote_stonith_query) 	info: All query replies have arrived, continuing (2 expected/2 received) 
> Aug 24 19:58:12.983 virt-513 pacemaker-fenced    [54053] (call_remote_stonith) 	info: Total timeout set to 120 for peer's fencing targeting virt-514 for stonith_admin.70108|id=63dd8345
> Aug 24 19:58:12.983 virt-513 pacemaker-fenced    [54053] (call_remote_stonith) 	notice: Requesting that virt-513 perform 'reboot' action targeting virt-514 | for client stonith_admin.70108 (144s, 0s)
> Aug 24 19:58:12.984 virt-513 pacemaker-fenced    [54053] (can_fence_host_with_device) 	notice: fence-virt-514 is eligible to fence (reboot) virt-514 (aka. 'virt-514.cluster-qe.lab.eng.brq.redhat.com'): static-list
> Aug 24 19:58:12.984 virt-513 pacemaker-fenced    [54053] (can_fence_host_with_device) 	notice: fence-virt-513 is not eligible to fence (reboot) virt-514: static-list
> Aug 24 19:58:13.062 virt-513 pacemaker-fenced    [54053] (log_op_output) 	notice: fence_bz1978013_list_1[70113] error output [ 2021-08-24 19:58:13,055 ERROR: Failed: Unrecognised action 'list' ]
> Aug 24 19:58:13.062 virt-513 pacemaker-fenced    [54053] (log_op_output) 	notice: fence_bz1978013_list_1[70113] error output [  ]
> Aug 24 19:58:13.062 virt-513 pacemaker-fenced    [54053] (log_op_output) 	notice: fence_bz1978013_list_1[70113] error output [ 2021-08-24 19:58:13,055 ERROR: Please use '-h' for usage ]
> Aug 24 19:58:13.062 virt-513 pacemaker-fenced    [54053] (log_op_output) 	notice: fence_bz1978013_list_1[70113] error output [  ]
> Aug 24 19:58:13.062 virt-513 pacemaker-fenced    [54053] (log_action) 	warning: fence_bz1978013[70113] stderr: [ 2021-08-24 19:58:13,055 ERROR: Failed: Unrecognised action 'list' ]
> Aug 24 19:58:13.062 virt-513 pacemaker-fenced    [54053] (log_action) 	warning: fence_bz1978013[70113] stderr: [  ]
> Aug 24 19:58:13.062 virt-513 pacemaker-fenced    [54053] (log_action) 	warning: fence_bz1978013[70113] stderr: [ 2021-08-24 19:58:13,055 ERROR: Please use '-h' for usage ]
> Aug 24 19:58:13.063 virt-513 pacemaker-fenced    [54053] (log_action) 	warning: fence_bz1978013[70113] stderr: [  ]
> Aug 24 19:58:13.063 virt-513 pacemaker-fenced    [54053] (internal_stonith_action_execute) 	info: Attempt 2 to execute fence_bz1978013 (list). remaining timeout is 119
> Aug 24 19:58:14.144 virt-513 pacemaker-fenced    [54053] (log_op_output) 	notice: fence_bz1978013_list_2[70114] error output [ 2021-08-24 19:58:14,134 ERROR: Failed: Unrecognised action 'list' ]
> Aug 24 19:58:14.144 virt-513 pacemaker-fenced    [54053] (log_op_output) 	notice: fence_bz1978013_list_2[70114] error output [  ]
> Aug 24 19:58:14.144 virt-513 pacemaker-fenced    [54053] (log_op_output) 	notice: fence_bz1978013_list_2[70114] error output [ 2021-08-24 19:58:14,135 ERROR: Please use '-h' for usage ]
> Aug 24 19:58:14.144 virt-513 pacemaker-fenced    [54053] (log_op_output) 	notice: fence_bz1978013_list_2[70114] error output [  ]
> Aug 24 19:58:14.145 virt-513 pacemaker-fenced    [54053] (log_action) 	warning: fence_bz1978013[70114] stderr: [ 2021-08-24 19:58:14,134 ERROR: Failed: Unrecognised action 'list' ]
> Aug 24 19:58:14.145 virt-513 pacemaker-fenced    [54053] (log_action) 	warning: fence_bz1978013[70114] stderr: [  ]
> Aug 24 19:58:14.145 virt-513 pacemaker-fenced    [54053] (log_action) 	warning: fence_bz1978013[70114] stderr: [ 2021-08-24 19:58:14,135 ERROR: Please use '-h' for usage ]
> Aug 24 19:58:14.145 virt-513 pacemaker-fenced    [54053] (log_action) 	warning: fence_bz1978013[70114] stderr: [  ]
> Aug 24 19:58:14.145 virt-513 pacemaker-fenced    [54053] (update_remaining_timeout) 	info: Attempted to execute agent fence_bz1978013 (list) the maximum number of times (2) allowed
> Aug 24 19:58:14.145 virt-513 pacemaker-fenced    [54053] (stonith_fence_get_devices_cb) 	info: Found 1 matching device for target 'virt-514'
> Aug 24 19:58:16.581 virt-513 pacemaker-fenced    [54053] (log_operation) 	notice: Operation 'reboot' [70115] (call 2 from stonith_admin.70108) targeting virt-514 using fence-virt-514 returned 0 (OK)


Result: The dummy fence device is ignored due to failing list action, the fallback real fence device is selected and used without an unnecessary 2 minute delay.