Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.
RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 2106170

Summary: Unfencing should fail immediately if one automatic unfencing device fails
Product: Red Hat Enterprise Linux 8 Reporter: Reid Wahl <nwahl>
Component: pacemakerAssignee: Ken Gaillot <kgaillot>
Status: CLOSED MIGRATED QA Contact: cluster-qe <cluster-qe>
Severity: low Docs Contact:
Priority: low    
Version: 8.6CC: cluster-maint
Target Milestone: rcKeywords: MigratedToJIRA, Triaged
Target Release: ---Flags: pm-rhel: mirror+
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2023-09-22 19:44:36 UTC Type: Enhancement
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Reid Wahl 2022-07-11 22:40:41 UTC
Description of problem:

Suppose we have two fencing levels. One level contains a device that will fail its "on" action, and the other level contains a device that will pass its "on" action. Both are automatic. All devices with automatic unfencing must succeed in order for the unfencing action to succeed, so we should short-circuit if the first device/level fails. Instead, we run the second level. The second level succeeds and then we declare the unfencing action a failure.

[root@fastvm-rhel-8-0-23 ~]# pcs stonith config
 Resource: goodfence (class=stonith type=fence_dummy_auto_unfence)
  Attributes: mode=pass monitor_mode=pass pcmk_host_list="node1 node2"
  Meta Attrs: provides=unfencing
  Operations: monitor interval=60s (goodfence-monitor-interval-60s)
 Resource: badfence (class=stonith type=fence_dummy_auto_unfence)
  Attributes: mode=fail monitor_mode=pass pcmk_host_list="node1 node2"
  Operations: monitor interval=60s (badfence-monitor-interval-60s)
 Target: node2
   Level 1 - badfence
   Level 2 - goodfence


[root@fastvm-rhel-8-0-23 ~]# pcs cluster start node2 & tail -f /var/log/pacemaker/pacemaker.log 
...
Jul 11 15:30:27 fastvm-rhel-8-0-23 pacemaker-controld  [1796707] (te_fence_node) 	notice: Requesting fencing (on) of node node2 | action=3 timeout=60000
Jul 11 15:30:27 fastvm-rhel-8-0-23 pacemaker-fenced    [1796703] (handle_request) 	notice: Client pacemaker-controld.1796707 wants to fence (on) node2 using any device
Jul 11 15:30:27 fastvm-rhel-8-0-23 pacemaker-fenced    [1796703] (initiate_remote_stonith_op) 	notice: Requesting peer fencing (on) targeting node2 | id=8e3f6d15 state=querying base_timeout=60
Jul 11 15:30:27 fastvm-rhel-8-0-23 pacemaker-fenced    [1796703] (can_fence_host_with_device) 	notice: goodfence is eligible to fence (on) node2: static-list
Jul 11 15:30:27 fastvm-rhel-8-0-23 pacemaker-fenced    [1796703] (can_fence_host_with_device) 	notice: badfence is eligible to fence (on) node2: static-list
Jul 11 15:30:27 fastvm-rhel-8-0-23 pacemaker-fenced    [1796703] (process_remote_stonith_query) 	info: Query result 1 of 2 from node1 for node2/on (2 devices) 8e3f6d15-531b-4245-843a-148dd57f7282
Jul 11 15:30:27 fastvm-rhel-8-0-23 pacemaker-fenced    [1796703] (request_peer_fencing) 	info: Total timeout set to 120 for peer's fencing targeting node2 for pacemaker-controld.1796707|id=8e3f6d15
Jul 11 15:30:27 fastvm-rhel-8-0-23 pacemaker-fenced    [1796703] (request_peer_fencing) 	notice: Requesting that node1 perform 'on' action targeting node2 using badfence | for client pacemaker-controld.1796707 (72s)
Jul 11 15:30:27 fastvm-rhel-8-0-23 pacemaker-fenced    [1796703] (make_args) 	info: Passing '2' as nodeid with fence action 'on' targeting node2
Jul 11 15:30:27 fastvm-rhel-8-0-23 pacemaker-fenced    [1796703] (process_remote_stonith_query) 	info: Query result 2 of 2 from node2 for node2/on (0 devices) 8e3f6d15-531b-4245-843a-148dd57f7282
Jul 11 15:30:27 fastvm-rhel-8-0-23 pacemaker-fenced    [1796703] (log_op_output) 	info: fence_dummy_auto_unfence_on_1[1797364] error output [ simulated on failure ]
Jul 11 15:30:27 fastvm-rhel-8-0-23 pacemaker-fenced    [1796703] (log_action) 	warning: fence_dummy_auto_unfence[1797364] stderr: [ simulated on failure ]
Jul 11 15:30:27 fastvm-rhel-8-0-23 pacemaker-fenced    [1796703] (internal_stonith_action_execute) 	info: Attempt 2 to execute fence_dummy_auto_unfence (on). remaining timeout is 60
Jul 11 15:30:28 fastvm-rhel-8-0-23 pacemaker-fenced    [1796703] (log_op_output) 	info: fence_dummy_auto_unfence_on_2[1797367] error output [ simulated on failure ]
Jul 11 15:30:28 fastvm-rhel-8-0-23 pacemaker-fenced    [1796703] (log_action) 	warning: fence_dummy_auto_unfence[1797367] stderr: [ simulated on failure ]
Jul 11 15:30:28 fastvm-rhel-8-0-23 pacemaker-fenced    [1796703] (update_remaining_timeout) 	info: Attempted to execute agent fence_dummy_auto_unfence (on) the maximum number of times (2) allowed
Jul 11 15:30:28 fastvm-rhel-8-0-23 pacemaker-fenced    [1796703] (log_async_result) 	error: Operation 'on' [1797367] targeting node2 using badfence returned 1 | call 24 from pacemaker-controld.1796707
Jul 11 15:30:28 fastvm-rhel-8-0-23 pacemaker-fenced    [1796703] (fenced_process_fencing_reply) 	notice: Action 'on' targeting node2 using badfence on behalf of pacemaker-controld.1796707@node1: complete
Jul 11 15:30:28 fastvm-rhel-8-0-23 pacemaker-fenced    [1796703] (request_peer_fencing) 	notice: Requesting that node1 perform 'on' action targeting node2 using goodfence | for client pacemaker-controld.1796707 (72s)
Jul 11 15:30:28 fastvm-rhel-8-0-23 pacemaker-fenced    [1796703] (make_args) 	info: Passing '2' as nodeid with fence action 'on' targeting node2
Jul 11 15:30:28 fastvm-rhel-8-0-23 pacemaker-fenced    [1796703] (log_async_result) 	notice: Operation 'on' [1797369] targeting node2 using goodfence returned 0 | call 24 from pacemaker-controld.1796707
Jul 11 15:30:28 fastvm-rhel-8-0-23 pacemaker-fenced    [1796703] (fenced_process_fencing_reply) 	notice: Action 'on' targeting node2 using goodfence on behalf of pacemaker-controld.1796707@node1: complete
Jul 11 15:30:28 fastvm-rhel-8-0-23 pacemaker-fenced    [1796703] (advance_topology_level) 	notice: All fencing options targeting node2 for client pacemaker-controld.1796707@node1 failed | id=8e3f6d15
Jul 11 15:30:28 fastvm-rhel-8-0-23 pacemaker-fenced    [1796703] (stonith_choose_peer) 	notice: Couldn't find anyone to fence (on) node2 using badfence
Jul 11 15:30:28 fastvm-rhel-8-0-23 pacemaker-fenced    [1796703] (request_peer_fencing) 	info: No peers (out of 2) are capable of fencing (on) node2 for client pacemaker-controld.1796707 | state=executing
Jul 11 15:30:28 fastvm-rhel-8-0-23 pacemaker-fenced    [1796703] (finalize_op) 	error: Operation 'on' targeting node2 by unknown node for pacemaker-controld.1796707@node1: Error occurred (No fence device) | id=8e3f6d15
Jul 11 15:30:28 fastvm-rhel-8-0-23 pacemaker-controld  [1796707] (tengine_stonith_callback) 	warning: Fence operation 24 for node2 failed: No fence device (aborting transition and giving up for now)
Jul 11 15:30:28 fastvm-rhel-8-0-23 pacemaker-controld  [1796707] (abort_transition_graph) 	notice: Transition 42 aborted: Stonith failed | source=abort_for_stonith_failure:257 complete=false
Jul 11 15:30:28 fastvm-rhel-8-0-23 pacemaker-controld  [1796707] (handle_fence_notification) 	error: Unfencing of node2 by the cluster failed (No fence device) with exit status 1

---

Version-Release number of selected component (if applicable):

pacemaker-2.1.2-4.el8_6.2.x86_64

---

How reproducible:

Always

---

Steps to Reproduce:
1. Configure a fence_dummy_auto_unfence device with mode=fail (let's call it "badfence") as level 1 and another one with mode=pass (call it "goodfence") as level 2 for a particular node.
2. Join that node to the cluster.

---

Actual results:

badfence fails its "on" action. goodfence passes its "on" action. The unfencing action is declared a failure.

---

Expected results:

badfence fails its "on" action and the unfencing action is declared a failure. Alternatively, we could re-evaluate the design so that we proceed to level 2 and honor level 2's result, regardless of "automatic" unfencing.

---

Additional info:

I had said privately today that the behavior was reproducible with non-automatic devices (e.g., fence_kdump). I can't reproduce that now, so I was probably mistaken.

Comment 1 Reid Wahl 2022-07-11 22:53:10 UTC
Perhaps more importantly, the behavior is inconsistent. Let's look at two more scenarios. All three scenarios exhibit different behavior. It seems to me that they should behave the same, regardless of what we define as the correct behavior.

1. Below, I've added a working "xvm" fence device to the front of level 1. Now, xvm passes its "on" action and badfence fails its "on" action, but we move on to level 2 and honor level 2's successful result.

[root@fastvm-rhel-8-0-23 ~]# pcs stonith config
 Resource: goodfence (class=stonith type=fence_dummy)
  Attributes: mode=pass monitor_mode=pass pcmk_host_list="node1 node2"
  Meta Attrs: provides=unfencing
  Operations: monitor interval=60s (goodfence-monitor-interval-60s)
 Resource: badfence (class=stonith type=fence_dummy)
  Attributes: mode=fail monitor_mode=pass pcmk_host_list="node1 node2"
  Operations: monitor interval=60s (badfence-monitor-interval-60s)
 Resource: xvm (class=stonith type=fence_xvm)
  Attributes: pcmk_host_map=node1:fastvm-rhel-8.0-23;node2:fastvm-rhel-8.0-24
  Operations: monitor interval=60s (xvm-monitor-interval-60s)
 Target: node2
   Level 1 - xvm,badfence
   Level 2 - goodfence

Jul 11 15:45:00 fastvm-rhel-8-0-23 pacemaker-controld  [1796707] (te_fence_node) 	notice: Requesting fencing (on) of node node2 | action=4 timeout=60000
Jul 11 15:45:00 fastvm-rhel-8-0-23 pacemaker-fenced    [1796703] (handle_request) 	notice: Client pacemaker-controld.1796707 wants to fence (on) node2 using any device
Jul 11 15:45:00 fastvm-rhel-8-0-23 pacemaker-fenced    [1796703] (initiate_remote_stonith_op) 	notice: Requesting peer fencing (on) targeting node2 | id=aaaea5b8 state=querying base_timeout=60
Jul 11 15:45:00 fastvm-rhel-8-0-23 pacemaker-fenced    [1796703] (can_fence_host_with_device) 	notice: xvm is eligible to fence (on) node2 (aka. 'fastvm-rhel-8.0-24'): static-list
Jul 11 15:45:00 fastvm-rhel-8-0-23 pacemaker-fenced    [1796703] (process_remote_stonith_query) 	info: Query result 1 of 2 from node1 for node2/on (1 device) aaaea5b8-d34a-457b-a821-73b2bd716038
Jul 11 15:45:00 fastvm-rhel-8-0-23 pacemaker-fenced    [1796703] (process_remote_stonith_query) 	info: Query result 2 of 2 from node2 for node2/on (2 devices) aaaea5b8-d34a-457b-a821-73b2bd716038
Jul 11 15:45:00 fastvm-rhel-8-0-23 pacemaker-fenced    [1796703] (request_peer_fencing) 	info: Total timeout set to 180 for peer's fencing targeting node2 for pacemaker-controld.1796707|id=aaaea5b8
Jul 11 15:45:00 fastvm-rhel-8-0-23 pacemaker-fenced    [1796703] (request_peer_fencing) 	notice: Requesting that node1 perform 'on' action targeting node2 using xvm | for client pacemaker-controld.1796707 (72s)
Jul 11 15:45:00 fastvm-rhel-8-0-23 pacemaker-fenced    [1796703] (log_async_result) 	notice: Operation 'on' [1797783] targeting node2 using xvm returned 0 | call 33 from pacemaker-controld.1796707
Jul 11 15:45:00 fastvm-rhel-8-0-23 pacemaker-fenced    [1796703] (fenced_process_fencing_reply) 	notice: Action 'on' targeting node2 using xvm on behalf of pacemaker-controld.1796707@node1: complete
Jul 11 15:45:00 fastvm-rhel-8-0-23 pacemaker-fenced    [1796703] (request_peer_fencing) 	notice: Requesting that node2 perform 'on' action targeting node2 using badfence | for client pacemaker-controld.1796707 (72s)
Jul 11 15:45:02 fastvm-rhel-8-0-24 pacemaker-fenced    [142828] (log_async_result)  error: Operation 'on' [142841] targeting node2 using badfence returned 1 | call 33 from pacemaker-controld.1796707
Jul 11 15:45:02 fastvm-rhel-8-0-23 pacemaker-fenced    [1796703] (fenced_process_fencing_reply) 	notice: Action 'on' targeting node2 using badfence on behalf of pacemaker-controld.1796707@node1: complete
Jul 11 15:45:02 fastvm-rhel-8-0-23 pacemaker-fenced    [1796703] (request_peer_fencing) 	notice: Requesting that node2 perform 'on' action targeting node2 using goodfence | for client pacemaker-controld.1796707 (72s)
Jul 11 15:45:02 fastvm-rhel-8-0-23 pacemaker-fenced    [1796703] (fenced_process_fencing_reply) 	notice: Action 'on' targeting node2 using goodfence on behalf of pacemaker-controld.1796707@node1: complete
Jul 11 15:45:02 fastvm-rhel-8-0-24 pacemaker-fenced    [142828] (log_async_result)  notice: Operation 'on' [142843] targeting node2 using goodfence returned 0 | call 33 from pacemaker-controld.1796707
Jul 11 15:45:02 fastvm-rhel-8-0-24 pacemaker-fenced    [142828] (finalize_op)   notice: Operation 'on' targeting node2 by node2 for pacemaker-controld.1796707@node1: OK (complete) | id=aaaea5b8
Jul 11 15:45:02 fastvm-rhel-8-0-23 pacemaker-controld  [1796707] (tengine_stonith_callback) 	notice: Fence operation 33 for node2 passed
Jul 11 15:45:02 fastvm-rhel-8-0-23 pacemaker-controld  [1796707] (handle_fence_notification) 	notice: node2 was unfenced by node2 at the request of pacemaker-controld.1796707@node1

---

2. Finally, I've added to the beginning of **both** levels an xvm stonith device that will pass its "on" action. This time, we pass xvm in level 1, fail badfence in level 1, and then do not even attempt level 2. We short-circuit after level 1.

[root@fastvm-rhel-8-0-23 ~]# pcs stonith config
 Resource: goodfence (class=stonith type=fence_dummy)
  Attributes: mode=pass monitor_mode=pass pcmk_host_list="node1 node2"
  Meta Attrs: provides=unfencing
  Operations: monitor interval=60s (goodfence-monitor-interval-60s)
 Resource: badfence (class=stonith type=fence_dummy)
  Attributes: mode=fail monitor_mode=pass pcmk_host_list="node1 node2"
  Operations: monitor interval=60s (badfence-monitor-interval-60s)
 Resource: xvm (class=stonith type=fence_xvm)
  Attributes: pcmk_host_map=node1:fastvm-rhel-8.0-23;node2:fastvm-rhel-8.0-24
  Operations: monitor interval=60s (xvm-monitor-interval-60s)
 Target: node2
   Level 1 - xvm,badfence
   Level 2 - xvm,goodfence


Jul 11 15:51:38 fastvm-rhel-8-0-23 pacemaker-controld  [1796707] (te_fence_node) 	notice: Requesting fencing (on) of node node2 | action=4 timeout=60000
Jul 11 15:51:38 fastvm-rhel-8-0-23 pacemaker-fenced    [1796703] (handle_request) 	notice: Client pacemaker-controld.1796707 wants to fence (on) node2 using any device
Jul 11 15:51:38 fastvm-rhel-8-0-23 pacemaker-fenced    [1796703] (initiate_remote_stonith_op) 	notice: Requesting peer fencing (on) targeting node2 | id=e735ed02 state=querying base_timeout=60
Jul 11 15:51:38 fastvm-rhel-8-0-23 pacemaker-fenced    [1796703] (can_fence_host_with_device) 	notice: xvm is eligible to fence (on) node2 (aka. 'fastvm-rhel-8.0-24'): static-list
Jul 11 15:51:38 fastvm-rhel-8-0-23 pacemaker-fenced    [1796703] (process_remote_stonith_query) 	info: Query result 1 of 2 from node1 for node2/on (1 device) e735ed02-0faf-4eed-b993-7afb1cc27d12
Jul 11 15:51:38 fastvm-rhel-8-0-23 pacemaker-fenced    [1796703] (process_remote_stonith_query) 	info: Query result 2 of 2 from node2 for node2/on (2 devices) e735ed02-0faf-4eed-b993-7afb1cc27d12
Jul 11 15:51:38 fastvm-rhel-8-0-23 pacemaker-fenced    [1796703] (request_peer_fencing) 	info: Total timeout set to 240 for peer's fencing targeting node2 for pacemaker-controld.1796707|id=e735ed02
Jul 11 15:51:38 fastvm-rhel-8-0-23 pacemaker-fenced    [1796703] (request_peer_fencing) 	notice: Requesting that node1 perform 'on' action targeting node2 using xvm | for client pacemaker-controld.1796707 (72s)
Jul 11 15:51:38 fastvm-rhel-8-0-23 pacemaker-fenced    [1796703] (log_async_result) 	notice: Operation 'on' [1797939] targeting node2 using xvm returned 0 | call 35 from pacemaker-controld.1796707
Jul 11 15:51:38 fastvm-rhel-8-0-23 pacemaker-fenced    [1796703] (fenced_process_fencing_reply) 	notice: Action 'on' targeting node2 using xvm on behalf of pacemaker-controld.1796707@node1: complete
Jul 11 15:51:38 fastvm-rhel-8-0-23 pacemaker-fenced    [1796703] (request_peer_fencing) 	notice: Requesting that node2 perform 'on' action targeting node2 using badfence | for client pacemaker-controld.1796707 (72s)
Jul 11 15:51:39 fastvm-rhel-8-0-23 pacemaker-fenced    [1796703] (fenced_process_fencing_reply) 	notice: Action 'on' targeting node2 using badfence on behalf of pacemaker-controld.1796707@node1: complete
Jul 11 15:51:39 fastvm-rhel-8-0-23 pacemaker-fenced    [1796703] (advance_topology_level) 	notice: All fencing options targeting node2 for client pacemaker-controld.1796707@node1 failed | id=e735ed02
Jul 11 15:51:39 fastvm-rhel-8-0-23 pacemaker-fenced    [1796703] (stonith_choose_peer) 	notice: Couldn't find anyone to fence (on) node2 using xvm
Jul 11 15:51:39 fastvm-rhel-8-0-23 pacemaker-fenced    [1796703] (request_peer_fencing) 	info: No peers (out of 2) are capable of fencing (on) node2 for client pacemaker-controld.1796707 | state=executing
Jul 11 15:51:39 fastvm-rhel-8-0-23 pacemaker-fenced    [1796703] (finalize_op) 	error: Operation 'on' targeting node2 by unknown node for pacemaker-controld.1796707@node1: Error occurred (No fence device) | id=e735ed02
Jul 11 15:51:39 fastvm-rhel-8-0-23 pacemaker-controld  [1796707] (tengine_stonith_callback) 	warning: Fence operation 35 for node2 failed: No fence device (aborting transition and giving up for now)
Jul 11 15:51:39 fastvm-rhel-8-0-23 pacemaker-controld  [1796707] (abort_transition_graph) 	notice: Transition 78 aborted: Stonith failed | source=abort_for_stonith_failure:257 complete=false
Jul 11 15:51:39 fastvm-rhel-8-0-23 pacemaker-controld  [1796707] (handle_fence_notification) 	error: Unfencing of node2 by the cluster failed (No fence device) with exit status 1

Comment 2 Reid Wahl 2022-07-12 04:03:24 UTC
(In reply to Reid Wahl from comment #1)
I didn't realize that I had switched the stonith devices back from fence_dummy_auto_unfence to fence_dummy. The issue in comment 0 is still valid. Comments on the others below.

> 1. Below, I've added a working "xvm" fence device to the front of level 1.
> Now, xvm passes its "on" action and badfence fails its "on" action, but we
> move on to level 2 and honor level 2's successful result.

This does not apply to auto_unfence, so it's behaving as designed.

 
> 2. Finally, I've added to the beginning of **both** levels an xvm stonith
> device that will pass its "on" action. This time, we pass xvm in level 1,
> fail badfence in level 1, and then do not even attempt level 2. We
> short-circuit after level 1.

I opened BZ2106182 for this. It happens without auto unfencing and IIRC also with auto unfencing.

Comment 4 RHEL Program Management 2023-09-22 19:39:33 UTC
Issue migration from Bugzilla to Jira is in process at this time. This will be the last message in Jira copied from the Bugzilla bug.

Comment 5 RHEL Program Management 2023-09-22 19:44:36 UTC
This BZ has been automatically migrated to the issues.redhat.com Red Hat Issue Tracker. All future work related to this report will be managed there.

Due to differences in account names between systems, some fields were not replicated.  Be sure to add yourself to Jira issue's "Watchers" field to continue receiving updates and add others to the "Need Info From" field to continue requesting information.

To find the migrated issue, look in the "Links" section for a direct link to the new issue location. The issue key will have an icon of 2 footprints next to it, and begin with "RHEL-" followed by an integer.  You can also find this issue by visiting https://issues.redhat.com/issues/?jql= and searching the "Bugzilla Bug" field for this BZ's number, e.g. a search like:

"Bugzilla Bug" = 1234567

In the event you have trouble locating or viewing this issue, you can file an issue by sending mail to rh-issues. You can also visit https://access.redhat.com/articles/7032570 for general account information.