Description of problem: Suppose we have two fencing levels. One level contains a device that will fail its "on" action, and the other level contains a device that will pass its "on" action. Both are automatic. All devices with automatic unfencing must succeed in order for the unfencing action to succeed, so we should short-circuit if the first device/level fails. Instead, we run the second level. The second level succeeds and then we declare the unfencing action a failure. [root@fastvm-rhel-8-0-23 ~]# pcs stonith config Resource: goodfence (class=stonith type=fence_dummy_auto_unfence) Attributes: mode=pass monitor_mode=pass pcmk_host_list="node1 node2" Meta Attrs: provides=unfencing Operations: monitor interval=60s (goodfence-monitor-interval-60s) Resource: badfence (class=stonith type=fence_dummy_auto_unfence) Attributes: mode=fail monitor_mode=pass pcmk_host_list="node1 node2" Operations: monitor interval=60s (badfence-monitor-interval-60s) Target: node2 Level 1 - badfence Level 2 - goodfence [root@fastvm-rhel-8-0-23 ~]# pcs cluster start node2 & tail -f /var/log/pacemaker/pacemaker.log ... Jul 11 15:30:27 fastvm-rhel-8-0-23 pacemaker-controld [1796707] (te_fence_node) notice: Requesting fencing (on) of node node2 | action=3 timeout=60000 Jul 11 15:30:27 fastvm-rhel-8-0-23 pacemaker-fenced [1796703] (handle_request) notice: Client pacemaker-controld.1796707 wants to fence (on) node2 using any device Jul 11 15:30:27 fastvm-rhel-8-0-23 pacemaker-fenced [1796703] (initiate_remote_stonith_op) notice: Requesting peer fencing (on) targeting node2 | id=8e3f6d15 state=querying base_timeout=60 Jul 11 15:30:27 fastvm-rhel-8-0-23 pacemaker-fenced [1796703] (can_fence_host_with_device) notice: goodfence is eligible to fence (on) node2: static-list Jul 11 15:30:27 fastvm-rhel-8-0-23 pacemaker-fenced [1796703] (can_fence_host_with_device) notice: badfence is eligible to fence (on) node2: static-list Jul 11 15:30:27 fastvm-rhel-8-0-23 pacemaker-fenced [1796703] (process_remote_stonith_query) info: Query result 1 of 2 from node1 for node2/on (2 devices) 8e3f6d15-531b-4245-843a-148dd57f7282 Jul 11 15:30:27 fastvm-rhel-8-0-23 pacemaker-fenced [1796703] (request_peer_fencing) info: Total timeout set to 120 for peer's fencing targeting node2 for pacemaker-controld.1796707|id=8e3f6d15 Jul 11 15:30:27 fastvm-rhel-8-0-23 pacemaker-fenced [1796703] (request_peer_fencing) notice: Requesting that node1 perform 'on' action targeting node2 using badfence | for client pacemaker-controld.1796707 (72s) Jul 11 15:30:27 fastvm-rhel-8-0-23 pacemaker-fenced [1796703] (make_args) info: Passing '2' as nodeid with fence action 'on' targeting node2 Jul 11 15:30:27 fastvm-rhel-8-0-23 pacemaker-fenced [1796703] (process_remote_stonith_query) info: Query result 2 of 2 from node2 for node2/on (0 devices) 8e3f6d15-531b-4245-843a-148dd57f7282 Jul 11 15:30:27 fastvm-rhel-8-0-23 pacemaker-fenced [1796703] (log_op_output) info: fence_dummy_auto_unfence_on_1[1797364] error output [ simulated on failure ] Jul 11 15:30:27 fastvm-rhel-8-0-23 pacemaker-fenced [1796703] (log_action) warning: fence_dummy_auto_unfence[1797364] stderr: [ simulated on failure ] Jul 11 15:30:27 fastvm-rhel-8-0-23 pacemaker-fenced [1796703] (internal_stonith_action_execute) info: Attempt 2 to execute fence_dummy_auto_unfence (on). remaining timeout is 60 Jul 11 15:30:28 fastvm-rhel-8-0-23 pacemaker-fenced [1796703] (log_op_output) info: fence_dummy_auto_unfence_on_2[1797367] error output [ simulated on failure ] Jul 11 15:30:28 fastvm-rhel-8-0-23 pacemaker-fenced [1796703] (log_action) warning: fence_dummy_auto_unfence[1797367] stderr: [ simulated on failure ] Jul 11 15:30:28 fastvm-rhel-8-0-23 pacemaker-fenced [1796703] (update_remaining_timeout) info: Attempted to execute agent fence_dummy_auto_unfence (on) the maximum number of times (2) allowed Jul 11 15:30:28 fastvm-rhel-8-0-23 pacemaker-fenced [1796703] (log_async_result) error: Operation 'on' [1797367] targeting node2 using badfence returned 1 | call 24 from pacemaker-controld.1796707 Jul 11 15:30:28 fastvm-rhel-8-0-23 pacemaker-fenced [1796703] (fenced_process_fencing_reply) notice: Action 'on' targeting node2 using badfence on behalf of pacemaker-controld.1796707@node1: complete Jul 11 15:30:28 fastvm-rhel-8-0-23 pacemaker-fenced [1796703] (request_peer_fencing) notice: Requesting that node1 perform 'on' action targeting node2 using goodfence | for client pacemaker-controld.1796707 (72s) Jul 11 15:30:28 fastvm-rhel-8-0-23 pacemaker-fenced [1796703] (make_args) info: Passing '2' as nodeid with fence action 'on' targeting node2 Jul 11 15:30:28 fastvm-rhel-8-0-23 pacemaker-fenced [1796703] (log_async_result) notice: Operation 'on' [1797369] targeting node2 using goodfence returned 0 | call 24 from pacemaker-controld.1796707 Jul 11 15:30:28 fastvm-rhel-8-0-23 pacemaker-fenced [1796703] (fenced_process_fencing_reply) notice: Action 'on' targeting node2 using goodfence on behalf of pacemaker-controld.1796707@node1: complete Jul 11 15:30:28 fastvm-rhel-8-0-23 pacemaker-fenced [1796703] (advance_topology_level) notice: All fencing options targeting node2 for client pacemaker-controld.1796707@node1 failed | id=8e3f6d15 Jul 11 15:30:28 fastvm-rhel-8-0-23 pacemaker-fenced [1796703] (stonith_choose_peer) notice: Couldn't find anyone to fence (on) node2 using badfence Jul 11 15:30:28 fastvm-rhel-8-0-23 pacemaker-fenced [1796703] (request_peer_fencing) info: No peers (out of 2) are capable of fencing (on) node2 for client pacemaker-controld.1796707 | state=executing Jul 11 15:30:28 fastvm-rhel-8-0-23 pacemaker-fenced [1796703] (finalize_op) error: Operation 'on' targeting node2 by unknown node for pacemaker-controld.1796707@node1: Error occurred (No fence device) | id=8e3f6d15 Jul 11 15:30:28 fastvm-rhel-8-0-23 pacemaker-controld [1796707] (tengine_stonith_callback) warning: Fence operation 24 for node2 failed: No fence device (aborting transition and giving up for now) Jul 11 15:30:28 fastvm-rhel-8-0-23 pacemaker-controld [1796707] (abort_transition_graph) notice: Transition 42 aborted: Stonith failed | source=abort_for_stonith_failure:257 complete=false Jul 11 15:30:28 fastvm-rhel-8-0-23 pacemaker-controld [1796707] (handle_fence_notification) error: Unfencing of node2 by the cluster failed (No fence device) with exit status 1 --- Version-Release number of selected component (if applicable): pacemaker-2.1.2-4.el8_6.2.x86_64 --- How reproducible: Always --- Steps to Reproduce: 1. Configure a fence_dummy_auto_unfence device with mode=fail (let's call it "badfence") as level 1 and another one with mode=pass (call it "goodfence") as level 2 for a particular node. 2. Join that node to the cluster. --- Actual results: badfence fails its "on" action. goodfence passes its "on" action. The unfencing action is declared a failure. --- Expected results: badfence fails its "on" action and the unfencing action is declared a failure. Alternatively, we could re-evaluate the design so that we proceed to level 2 and honor level 2's result, regardless of "automatic" unfencing. --- Additional info: I had said privately today that the behavior was reproducible with non-automatic devices (e.g., fence_kdump). I can't reproduce that now, so I was probably mistaken.
Perhaps more importantly, the behavior is inconsistent. Let's look at two more scenarios. All three scenarios exhibit different behavior. It seems to me that they should behave the same, regardless of what we define as the correct behavior. 1. Below, I've added a working "xvm" fence device to the front of level 1. Now, xvm passes its "on" action and badfence fails its "on" action, but we move on to level 2 and honor level 2's successful result. [root@fastvm-rhel-8-0-23 ~]# pcs stonith config Resource: goodfence (class=stonith type=fence_dummy) Attributes: mode=pass monitor_mode=pass pcmk_host_list="node1 node2" Meta Attrs: provides=unfencing Operations: monitor interval=60s (goodfence-monitor-interval-60s) Resource: badfence (class=stonith type=fence_dummy) Attributes: mode=fail monitor_mode=pass pcmk_host_list="node1 node2" Operations: monitor interval=60s (badfence-monitor-interval-60s) Resource: xvm (class=stonith type=fence_xvm) Attributes: pcmk_host_map=node1:fastvm-rhel-8.0-23;node2:fastvm-rhel-8.0-24 Operations: monitor interval=60s (xvm-monitor-interval-60s) Target: node2 Level 1 - xvm,badfence Level 2 - goodfence Jul 11 15:45:00 fastvm-rhel-8-0-23 pacemaker-controld [1796707] (te_fence_node) notice: Requesting fencing (on) of node node2 | action=4 timeout=60000 Jul 11 15:45:00 fastvm-rhel-8-0-23 pacemaker-fenced [1796703] (handle_request) notice: Client pacemaker-controld.1796707 wants to fence (on) node2 using any device Jul 11 15:45:00 fastvm-rhel-8-0-23 pacemaker-fenced [1796703] (initiate_remote_stonith_op) notice: Requesting peer fencing (on) targeting node2 | id=aaaea5b8 state=querying base_timeout=60 Jul 11 15:45:00 fastvm-rhel-8-0-23 pacemaker-fenced [1796703] (can_fence_host_with_device) notice: xvm is eligible to fence (on) node2 (aka. 'fastvm-rhel-8.0-24'): static-list Jul 11 15:45:00 fastvm-rhel-8-0-23 pacemaker-fenced [1796703] (process_remote_stonith_query) info: Query result 1 of 2 from node1 for node2/on (1 device) aaaea5b8-d34a-457b-a821-73b2bd716038 Jul 11 15:45:00 fastvm-rhel-8-0-23 pacemaker-fenced [1796703] (process_remote_stonith_query) info: Query result 2 of 2 from node2 for node2/on (2 devices) aaaea5b8-d34a-457b-a821-73b2bd716038 Jul 11 15:45:00 fastvm-rhel-8-0-23 pacemaker-fenced [1796703] (request_peer_fencing) info: Total timeout set to 180 for peer's fencing targeting node2 for pacemaker-controld.1796707|id=aaaea5b8 Jul 11 15:45:00 fastvm-rhel-8-0-23 pacemaker-fenced [1796703] (request_peer_fencing) notice: Requesting that node1 perform 'on' action targeting node2 using xvm | for client pacemaker-controld.1796707 (72s) Jul 11 15:45:00 fastvm-rhel-8-0-23 pacemaker-fenced [1796703] (log_async_result) notice: Operation 'on' [1797783] targeting node2 using xvm returned 0 | call 33 from pacemaker-controld.1796707 Jul 11 15:45:00 fastvm-rhel-8-0-23 pacemaker-fenced [1796703] (fenced_process_fencing_reply) notice: Action 'on' targeting node2 using xvm on behalf of pacemaker-controld.1796707@node1: complete Jul 11 15:45:00 fastvm-rhel-8-0-23 pacemaker-fenced [1796703] (request_peer_fencing) notice: Requesting that node2 perform 'on' action targeting node2 using badfence | for client pacemaker-controld.1796707 (72s) Jul 11 15:45:02 fastvm-rhel-8-0-24 pacemaker-fenced [142828] (log_async_result) error: Operation 'on' [142841] targeting node2 using badfence returned 1 | call 33 from pacemaker-controld.1796707 Jul 11 15:45:02 fastvm-rhel-8-0-23 pacemaker-fenced [1796703] (fenced_process_fencing_reply) notice: Action 'on' targeting node2 using badfence on behalf of pacemaker-controld.1796707@node1: complete Jul 11 15:45:02 fastvm-rhel-8-0-23 pacemaker-fenced [1796703] (request_peer_fencing) notice: Requesting that node2 perform 'on' action targeting node2 using goodfence | for client pacemaker-controld.1796707 (72s) Jul 11 15:45:02 fastvm-rhel-8-0-23 pacemaker-fenced [1796703] (fenced_process_fencing_reply) notice: Action 'on' targeting node2 using goodfence on behalf of pacemaker-controld.1796707@node1: complete Jul 11 15:45:02 fastvm-rhel-8-0-24 pacemaker-fenced [142828] (log_async_result) notice: Operation 'on' [142843] targeting node2 using goodfence returned 0 | call 33 from pacemaker-controld.1796707 Jul 11 15:45:02 fastvm-rhel-8-0-24 pacemaker-fenced [142828] (finalize_op) notice: Operation 'on' targeting node2 by node2 for pacemaker-controld.1796707@node1: OK (complete) | id=aaaea5b8 Jul 11 15:45:02 fastvm-rhel-8-0-23 pacemaker-controld [1796707] (tengine_stonith_callback) notice: Fence operation 33 for node2 passed Jul 11 15:45:02 fastvm-rhel-8-0-23 pacemaker-controld [1796707] (handle_fence_notification) notice: node2 was unfenced by node2 at the request of pacemaker-controld.1796707@node1 --- 2. Finally, I've added to the beginning of **both** levels an xvm stonith device that will pass its "on" action. This time, we pass xvm in level 1, fail badfence in level 1, and then do not even attempt level 2. We short-circuit after level 1. [root@fastvm-rhel-8-0-23 ~]# pcs stonith config Resource: goodfence (class=stonith type=fence_dummy) Attributes: mode=pass monitor_mode=pass pcmk_host_list="node1 node2" Meta Attrs: provides=unfencing Operations: monitor interval=60s (goodfence-monitor-interval-60s) Resource: badfence (class=stonith type=fence_dummy) Attributes: mode=fail monitor_mode=pass pcmk_host_list="node1 node2" Operations: monitor interval=60s (badfence-monitor-interval-60s) Resource: xvm (class=stonith type=fence_xvm) Attributes: pcmk_host_map=node1:fastvm-rhel-8.0-23;node2:fastvm-rhel-8.0-24 Operations: monitor interval=60s (xvm-monitor-interval-60s) Target: node2 Level 1 - xvm,badfence Level 2 - xvm,goodfence Jul 11 15:51:38 fastvm-rhel-8-0-23 pacemaker-controld [1796707] (te_fence_node) notice: Requesting fencing (on) of node node2 | action=4 timeout=60000 Jul 11 15:51:38 fastvm-rhel-8-0-23 pacemaker-fenced [1796703] (handle_request) notice: Client pacemaker-controld.1796707 wants to fence (on) node2 using any device Jul 11 15:51:38 fastvm-rhel-8-0-23 pacemaker-fenced [1796703] (initiate_remote_stonith_op) notice: Requesting peer fencing (on) targeting node2 | id=e735ed02 state=querying base_timeout=60 Jul 11 15:51:38 fastvm-rhel-8-0-23 pacemaker-fenced [1796703] (can_fence_host_with_device) notice: xvm is eligible to fence (on) node2 (aka. 'fastvm-rhel-8.0-24'): static-list Jul 11 15:51:38 fastvm-rhel-8-0-23 pacemaker-fenced [1796703] (process_remote_stonith_query) info: Query result 1 of 2 from node1 for node2/on (1 device) e735ed02-0faf-4eed-b993-7afb1cc27d12 Jul 11 15:51:38 fastvm-rhel-8-0-23 pacemaker-fenced [1796703] (process_remote_stonith_query) info: Query result 2 of 2 from node2 for node2/on (2 devices) e735ed02-0faf-4eed-b993-7afb1cc27d12 Jul 11 15:51:38 fastvm-rhel-8-0-23 pacemaker-fenced [1796703] (request_peer_fencing) info: Total timeout set to 240 for peer's fencing targeting node2 for pacemaker-controld.1796707|id=e735ed02 Jul 11 15:51:38 fastvm-rhel-8-0-23 pacemaker-fenced [1796703] (request_peer_fencing) notice: Requesting that node1 perform 'on' action targeting node2 using xvm | for client pacemaker-controld.1796707 (72s) Jul 11 15:51:38 fastvm-rhel-8-0-23 pacemaker-fenced [1796703] (log_async_result) notice: Operation 'on' [1797939] targeting node2 using xvm returned 0 | call 35 from pacemaker-controld.1796707 Jul 11 15:51:38 fastvm-rhel-8-0-23 pacemaker-fenced [1796703] (fenced_process_fencing_reply) notice: Action 'on' targeting node2 using xvm on behalf of pacemaker-controld.1796707@node1: complete Jul 11 15:51:38 fastvm-rhel-8-0-23 pacemaker-fenced [1796703] (request_peer_fencing) notice: Requesting that node2 perform 'on' action targeting node2 using badfence | for client pacemaker-controld.1796707 (72s) Jul 11 15:51:39 fastvm-rhel-8-0-23 pacemaker-fenced [1796703] (fenced_process_fencing_reply) notice: Action 'on' targeting node2 using badfence on behalf of pacemaker-controld.1796707@node1: complete Jul 11 15:51:39 fastvm-rhel-8-0-23 pacemaker-fenced [1796703] (advance_topology_level) notice: All fencing options targeting node2 for client pacemaker-controld.1796707@node1 failed | id=e735ed02 Jul 11 15:51:39 fastvm-rhel-8-0-23 pacemaker-fenced [1796703] (stonith_choose_peer) notice: Couldn't find anyone to fence (on) node2 using xvm Jul 11 15:51:39 fastvm-rhel-8-0-23 pacemaker-fenced [1796703] (request_peer_fencing) info: No peers (out of 2) are capable of fencing (on) node2 for client pacemaker-controld.1796707 | state=executing Jul 11 15:51:39 fastvm-rhel-8-0-23 pacemaker-fenced [1796703] (finalize_op) error: Operation 'on' targeting node2 by unknown node for pacemaker-controld.1796707@node1: Error occurred (No fence device) | id=e735ed02 Jul 11 15:51:39 fastvm-rhel-8-0-23 pacemaker-controld [1796707] (tengine_stonith_callback) warning: Fence operation 35 for node2 failed: No fence device (aborting transition and giving up for now) Jul 11 15:51:39 fastvm-rhel-8-0-23 pacemaker-controld [1796707] (abort_transition_graph) notice: Transition 78 aborted: Stonith failed | source=abort_for_stonith_failure:257 complete=false Jul 11 15:51:39 fastvm-rhel-8-0-23 pacemaker-controld [1796707] (handle_fence_notification) error: Unfencing of node2 by the cluster failed (No fence device) with exit status 1
(In reply to Reid Wahl from comment #1) I didn't realize that I had switched the stonith devices back from fence_dummy_auto_unfence to fence_dummy. The issue in comment 0 is still valid. Comments on the others below. > 1. Below, I've added a working "xvm" fence device to the front of level 1. > Now, xvm passes its "on" action and badfence fails its "on" action, but we > move on to level 2 and honor level 2's successful result. This does not apply to auto_unfence, so it's behaving as designed. > 2. Finally, I've added to the beginning of **both** levels an xvm stonith > device that will pass its "on" action. This time, we pass xvm in level 1, > fail badfence in level 1, and then do not even attempt level 2. We > short-circuit after level 1. I opened BZ2106182 for this. It happens without auto unfencing and IIRC also with auto unfencing.