Bug 2106170 - Unfencing should fail immediately if one automatic unfencing device fails
Summary: Unfencing should fail immediately if one automatic unfencing device fails
Keywords:
Status: NEW
Alias: None
Product: Red Hat Enterprise Linux 8
Classification: Red Hat
Component: pacemaker
Version: 8.6
Hardware: All
OS: Linux
low
low
Target Milestone: rc
: ---
Assignee: Ken Gaillot
QA Contact: cluster-qe
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2022-07-11 22:40 UTC by Reid Wahl
Modified: 2023-08-10 15:40 UTC (History)
1 user (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:
Type: Enhancement
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker RHELPLAN-127436 0 None None None 2022-07-11 22:44:22 UTC

Description Reid Wahl 2022-07-11 22:40:41 UTC
Description of problem:

Suppose we have two fencing levels. One level contains a device that will fail its "on" action, and the other level contains a device that will pass its "on" action. Both are automatic. All devices with automatic unfencing must succeed in order for the unfencing action to succeed, so we should short-circuit if the first device/level fails. Instead, we run the second level. The second level succeeds and then we declare the unfencing action a failure.

[root@fastvm-rhel-8-0-23 ~]# pcs stonith config
 Resource: goodfence (class=stonith type=fence_dummy_auto_unfence)
  Attributes: mode=pass monitor_mode=pass pcmk_host_list="node1 node2"
  Meta Attrs: provides=unfencing
  Operations: monitor interval=60s (goodfence-monitor-interval-60s)
 Resource: badfence (class=stonith type=fence_dummy_auto_unfence)
  Attributes: mode=fail monitor_mode=pass pcmk_host_list="node1 node2"
  Operations: monitor interval=60s (badfence-monitor-interval-60s)
 Target: node2
   Level 1 - badfence
   Level 2 - goodfence


[root@fastvm-rhel-8-0-23 ~]# pcs cluster start node2 & tail -f /var/log/pacemaker/pacemaker.log 
...
Jul 11 15:30:27 fastvm-rhel-8-0-23 pacemaker-controld  [1796707] (te_fence_node) 	notice: Requesting fencing (on) of node node2 | action=3 timeout=60000
Jul 11 15:30:27 fastvm-rhel-8-0-23 pacemaker-fenced    [1796703] (handle_request) 	notice: Client pacemaker-controld.1796707 wants to fence (on) node2 using any device
Jul 11 15:30:27 fastvm-rhel-8-0-23 pacemaker-fenced    [1796703] (initiate_remote_stonith_op) 	notice: Requesting peer fencing (on) targeting node2 | id=8e3f6d15 state=querying base_timeout=60
Jul 11 15:30:27 fastvm-rhel-8-0-23 pacemaker-fenced    [1796703] (can_fence_host_with_device) 	notice: goodfence is eligible to fence (on) node2: static-list
Jul 11 15:30:27 fastvm-rhel-8-0-23 pacemaker-fenced    [1796703] (can_fence_host_with_device) 	notice: badfence is eligible to fence (on) node2: static-list
Jul 11 15:30:27 fastvm-rhel-8-0-23 pacemaker-fenced    [1796703] (process_remote_stonith_query) 	info: Query result 1 of 2 from node1 for node2/on (2 devices) 8e3f6d15-531b-4245-843a-148dd57f7282
Jul 11 15:30:27 fastvm-rhel-8-0-23 pacemaker-fenced    [1796703] (request_peer_fencing) 	info: Total timeout set to 120 for peer's fencing targeting node2 for pacemaker-controld.1796707|id=8e3f6d15
Jul 11 15:30:27 fastvm-rhel-8-0-23 pacemaker-fenced    [1796703] (request_peer_fencing) 	notice: Requesting that node1 perform 'on' action targeting node2 using badfence | for client pacemaker-controld.1796707 (72s)
Jul 11 15:30:27 fastvm-rhel-8-0-23 pacemaker-fenced    [1796703] (make_args) 	info: Passing '2' as nodeid with fence action 'on' targeting node2
Jul 11 15:30:27 fastvm-rhel-8-0-23 pacemaker-fenced    [1796703] (process_remote_stonith_query) 	info: Query result 2 of 2 from node2 for node2/on (0 devices) 8e3f6d15-531b-4245-843a-148dd57f7282
Jul 11 15:30:27 fastvm-rhel-8-0-23 pacemaker-fenced    [1796703] (log_op_output) 	info: fence_dummy_auto_unfence_on_1[1797364] error output [ simulated on failure ]
Jul 11 15:30:27 fastvm-rhel-8-0-23 pacemaker-fenced    [1796703] (log_action) 	warning: fence_dummy_auto_unfence[1797364] stderr: [ simulated on failure ]
Jul 11 15:30:27 fastvm-rhel-8-0-23 pacemaker-fenced    [1796703] (internal_stonith_action_execute) 	info: Attempt 2 to execute fence_dummy_auto_unfence (on). remaining timeout is 60
Jul 11 15:30:28 fastvm-rhel-8-0-23 pacemaker-fenced    [1796703] (log_op_output) 	info: fence_dummy_auto_unfence_on_2[1797367] error output [ simulated on failure ]
Jul 11 15:30:28 fastvm-rhel-8-0-23 pacemaker-fenced    [1796703] (log_action) 	warning: fence_dummy_auto_unfence[1797367] stderr: [ simulated on failure ]
Jul 11 15:30:28 fastvm-rhel-8-0-23 pacemaker-fenced    [1796703] (update_remaining_timeout) 	info: Attempted to execute agent fence_dummy_auto_unfence (on) the maximum number of times (2) allowed
Jul 11 15:30:28 fastvm-rhel-8-0-23 pacemaker-fenced    [1796703] (log_async_result) 	error: Operation 'on' [1797367] targeting node2 using badfence returned 1 | call 24 from pacemaker-controld.1796707
Jul 11 15:30:28 fastvm-rhel-8-0-23 pacemaker-fenced    [1796703] (fenced_process_fencing_reply) 	notice: Action 'on' targeting node2 using badfence on behalf of pacemaker-controld.1796707@node1: complete
Jul 11 15:30:28 fastvm-rhel-8-0-23 pacemaker-fenced    [1796703] (request_peer_fencing) 	notice: Requesting that node1 perform 'on' action targeting node2 using goodfence | for client pacemaker-controld.1796707 (72s)
Jul 11 15:30:28 fastvm-rhel-8-0-23 pacemaker-fenced    [1796703] (make_args) 	info: Passing '2' as nodeid with fence action 'on' targeting node2
Jul 11 15:30:28 fastvm-rhel-8-0-23 pacemaker-fenced    [1796703] (log_async_result) 	notice: Operation 'on' [1797369] targeting node2 using goodfence returned 0 | call 24 from pacemaker-controld.1796707
Jul 11 15:30:28 fastvm-rhel-8-0-23 pacemaker-fenced    [1796703] (fenced_process_fencing_reply) 	notice: Action 'on' targeting node2 using goodfence on behalf of pacemaker-controld.1796707@node1: complete
Jul 11 15:30:28 fastvm-rhel-8-0-23 pacemaker-fenced    [1796703] (advance_topology_level) 	notice: All fencing options targeting node2 for client pacemaker-controld.1796707@node1 failed | id=8e3f6d15
Jul 11 15:30:28 fastvm-rhel-8-0-23 pacemaker-fenced    [1796703] (stonith_choose_peer) 	notice: Couldn't find anyone to fence (on) node2 using badfence
Jul 11 15:30:28 fastvm-rhel-8-0-23 pacemaker-fenced    [1796703] (request_peer_fencing) 	info: No peers (out of 2) are capable of fencing (on) node2 for client pacemaker-controld.1796707 | state=executing
Jul 11 15:30:28 fastvm-rhel-8-0-23 pacemaker-fenced    [1796703] (finalize_op) 	error: Operation 'on' targeting node2 by unknown node for pacemaker-controld.1796707@node1: Error occurred (No fence device) | id=8e3f6d15
Jul 11 15:30:28 fastvm-rhel-8-0-23 pacemaker-controld  [1796707] (tengine_stonith_callback) 	warning: Fence operation 24 for node2 failed: No fence device (aborting transition and giving up for now)
Jul 11 15:30:28 fastvm-rhel-8-0-23 pacemaker-controld  [1796707] (abort_transition_graph) 	notice: Transition 42 aborted: Stonith failed | source=abort_for_stonith_failure:257 complete=false
Jul 11 15:30:28 fastvm-rhel-8-0-23 pacemaker-controld  [1796707] (handle_fence_notification) 	error: Unfencing of node2 by the cluster failed (No fence device) with exit status 1

---

Version-Release number of selected component (if applicable):

pacemaker-2.1.2-4.el8_6.2.x86_64

---

How reproducible:

Always

---

Steps to Reproduce:
1. Configure a fence_dummy_auto_unfence device with mode=fail (let's call it "badfence") as level 1 and another one with mode=pass (call it "goodfence") as level 2 for a particular node.
2. Join that node to the cluster.

---

Actual results:

badfence fails its "on" action. goodfence passes its "on" action. The unfencing action is declared a failure.

---

Expected results:

badfence fails its "on" action and the unfencing action is declared a failure. Alternatively, we could re-evaluate the design so that we proceed to level 2 and honor level 2's result, regardless of "automatic" unfencing.

---

Additional info:

I had said privately today that the behavior was reproducible with non-automatic devices (e.g., fence_kdump). I can't reproduce that now, so I was probably mistaken.

Comment 1 Reid Wahl 2022-07-11 22:53:10 UTC
Perhaps more importantly, the behavior is inconsistent. Let's look at two more scenarios. All three scenarios exhibit different behavior. It seems to me that they should behave the same, regardless of what we define as the correct behavior.

1. Below, I've added a working "xvm" fence device to the front of level 1. Now, xvm passes its "on" action and badfence fails its "on" action, but we move on to level 2 and honor level 2's successful result.

[root@fastvm-rhel-8-0-23 ~]# pcs stonith config
 Resource: goodfence (class=stonith type=fence_dummy)
  Attributes: mode=pass monitor_mode=pass pcmk_host_list="node1 node2"
  Meta Attrs: provides=unfencing
  Operations: monitor interval=60s (goodfence-monitor-interval-60s)
 Resource: badfence (class=stonith type=fence_dummy)
  Attributes: mode=fail monitor_mode=pass pcmk_host_list="node1 node2"
  Operations: monitor interval=60s (badfence-monitor-interval-60s)
 Resource: xvm (class=stonith type=fence_xvm)
  Attributes: pcmk_host_map=node1:fastvm-rhel-8.0-23;node2:fastvm-rhel-8.0-24
  Operations: monitor interval=60s (xvm-monitor-interval-60s)
 Target: node2
   Level 1 - xvm,badfence
   Level 2 - goodfence

Jul 11 15:45:00 fastvm-rhel-8-0-23 pacemaker-controld  [1796707] (te_fence_node) 	notice: Requesting fencing (on) of node node2 | action=4 timeout=60000
Jul 11 15:45:00 fastvm-rhel-8-0-23 pacemaker-fenced    [1796703] (handle_request) 	notice: Client pacemaker-controld.1796707 wants to fence (on) node2 using any device
Jul 11 15:45:00 fastvm-rhel-8-0-23 pacemaker-fenced    [1796703] (initiate_remote_stonith_op) 	notice: Requesting peer fencing (on) targeting node2 | id=aaaea5b8 state=querying base_timeout=60
Jul 11 15:45:00 fastvm-rhel-8-0-23 pacemaker-fenced    [1796703] (can_fence_host_with_device) 	notice: xvm is eligible to fence (on) node2 (aka. 'fastvm-rhel-8.0-24'): static-list
Jul 11 15:45:00 fastvm-rhel-8-0-23 pacemaker-fenced    [1796703] (process_remote_stonith_query) 	info: Query result 1 of 2 from node1 for node2/on (1 device) aaaea5b8-d34a-457b-a821-73b2bd716038
Jul 11 15:45:00 fastvm-rhel-8-0-23 pacemaker-fenced    [1796703] (process_remote_stonith_query) 	info: Query result 2 of 2 from node2 for node2/on (2 devices) aaaea5b8-d34a-457b-a821-73b2bd716038
Jul 11 15:45:00 fastvm-rhel-8-0-23 pacemaker-fenced    [1796703] (request_peer_fencing) 	info: Total timeout set to 180 for peer's fencing targeting node2 for pacemaker-controld.1796707|id=aaaea5b8
Jul 11 15:45:00 fastvm-rhel-8-0-23 pacemaker-fenced    [1796703] (request_peer_fencing) 	notice: Requesting that node1 perform 'on' action targeting node2 using xvm | for client pacemaker-controld.1796707 (72s)
Jul 11 15:45:00 fastvm-rhel-8-0-23 pacemaker-fenced    [1796703] (log_async_result) 	notice: Operation 'on' [1797783] targeting node2 using xvm returned 0 | call 33 from pacemaker-controld.1796707
Jul 11 15:45:00 fastvm-rhel-8-0-23 pacemaker-fenced    [1796703] (fenced_process_fencing_reply) 	notice: Action 'on' targeting node2 using xvm on behalf of pacemaker-controld.1796707@node1: complete
Jul 11 15:45:00 fastvm-rhel-8-0-23 pacemaker-fenced    [1796703] (request_peer_fencing) 	notice: Requesting that node2 perform 'on' action targeting node2 using badfence | for client pacemaker-controld.1796707 (72s)
Jul 11 15:45:02 fastvm-rhel-8-0-24 pacemaker-fenced    [142828] (log_async_result)  error: Operation 'on' [142841] targeting node2 using badfence returned 1 | call 33 from pacemaker-controld.1796707
Jul 11 15:45:02 fastvm-rhel-8-0-23 pacemaker-fenced    [1796703] (fenced_process_fencing_reply) 	notice: Action 'on' targeting node2 using badfence on behalf of pacemaker-controld.1796707@node1: complete
Jul 11 15:45:02 fastvm-rhel-8-0-23 pacemaker-fenced    [1796703] (request_peer_fencing) 	notice: Requesting that node2 perform 'on' action targeting node2 using goodfence | for client pacemaker-controld.1796707 (72s)
Jul 11 15:45:02 fastvm-rhel-8-0-23 pacemaker-fenced    [1796703] (fenced_process_fencing_reply) 	notice: Action 'on' targeting node2 using goodfence on behalf of pacemaker-controld.1796707@node1: complete
Jul 11 15:45:02 fastvm-rhel-8-0-24 pacemaker-fenced    [142828] (log_async_result)  notice: Operation 'on' [142843] targeting node2 using goodfence returned 0 | call 33 from pacemaker-controld.1796707
Jul 11 15:45:02 fastvm-rhel-8-0-24 pacemaker-fenced    [142828] (finalize_op)   notice: Operation 'on' targeting node2 by node2 for pacemaker-controld.1796707@node1: OK (complete) | id=aaaea5b8
Jul 11 15:45:02 fastvm-rhel-8-0-23 pacemaker-controld  [1796707] (tengine_stonith_callback) 	notice: Fence operation 33 for node2 passed
Jul 11 15:45:02 fastvm-rhel-8-0-23 pacemaker-controld  [1796707] (handle_fence_notification) 	notice: node2 was unfenced by node2 at the request of pacemaker-controld.1796707@node1

---

2. Finally, I've added to the beginning of **both** levels an xvm stonith device that will pass its "on" action. This time, we pass xvm in level 1, fail badfence in level 1, and then do not even attempt level 2. We short-circuit after level 1.

[root@fastvm-rhel-8-0-23 ~]# pcs stonith config
 Resource: goodfence (class=stonith type=fence_dummy)
  Attributes: mode=pass monitor_mode=pass pcmk_host_list="node1 node2"
  Meta Attrs: provides=unfencing
  Operations: monitor interval=60s (goodfence-monitor-interval-60s)
 Resource: badfence (class=stonith type=fence_dummy)
  Attributes: mode=fail monitor_mode=pass pcmk_host_list="node1 node2"
  Operations: monitor interval=60s (badfence-monitor-interval-60s)
 Resource: xvm (class=stonith type=fence_xvm)
  Attributes: pcmk_host_map=node1:fastvm-rhel-8.0-23;node2:fastvm-rhel-8.0-24
  Operations: monitor interval=60s (xvm-monitor-interval-60s)
 Target: node2
   Level 1 - xvm,badfence
   Level 2 - xvm,goodfence


Jul 11 15:51:38 fastvm-rhel-8-0-23 pacemaker-controld  [1796707] (te_fence_node) 	notice: Requesting fencing (on) of node node2 | action=4 timeout=60000
Jul 11 15:51:38 fastvm-rhel-8-0-23 pacemaker-fenced    [1796703] (handle_request) 	notice: Client pacemaker-controld.1796707 wants to fence (on) node2 using any device
Jul 11 15:51:38 fastvm-rhel-8-0-23 pacemaker-fenced    [1796703] (initiate_remote_stonith_op) 	notice: Requesting peer fencing (on) targeting node2 | id=e735ed02 state=querying base_timeout=60
Jul 11 15:51:38 fastvm-rhel-8-0-23 pacemaker-fenced    [1796703] (can_fence_host_with_device) 	notice: xvm is eligible to fence (on) node2 (aka. 'fastvm-rhel-8.0-24'): static-list
Jul 11 15:51:38 fastvm-rhel-8-0-23 pacemaker-fenced    [1796703] (process_remote_stonith_query) 	info: Query result 1 of 2 from node1 for node2/on (1 device) e735ed02-0faf-4eed-b993-7afb1cc27d12
Jul 11 15:51:38 fastvm-rhel-8-0-23 pacemaker-fenced    [1796703] (process_remote_stonith_query) 	info: Query result 2 of 2 from node2 for node2/on (2 devices) e735ed02-0faf-4eed-b993-7afb1cc27d12
Jul 11 15:51:38 fastvm-rhel-8-0-23 pacemaker-fenced    [1796703] (request_peer_fencing) 	info: Total timeout set to 240 for peer's fencing targeting node2 for pacemaker-controld.1796707|id=e735ed02
Jul 11 15:51:38 fastvm-rhel-8-0-23 pacemaker-fenced    [1796703] (request_peer_fencing) 	notice: Requesting that node1 perform 'on' action targeting node2 using xvm | for client pacemaker-controld.1796707 (72s)
Jul 11 15:51:38 fastvm-rhel-8-0-23 pacemaker-fenced    [1796703] (log_async_result) 	notice: Operation 'on' [1797939] targeting node2 using xvm returned 0 | call 35 from pacemaker-controld.1796707
Jul 11 15:51:38 fastvm-rhel-8-0-23 pacemaker-fenced    [1796703] (fenced_process_fencing_reply) 	notice: Action 'on' targeting node2 using xvm on behalf of pacemaker-controld.1796707@node1: complete
Jul 11 15:51:38 fastvm-rhel-8-0-23 pacemaker-fenced    [1796703] (request_peer_fencing) 	notice: Requesting that node2 perform 'on' action targeting node2 using badfence | for client pacemaker-controld.1796707 (72s)
Jul 11 15:51:39 fastvm-rhel-8-0-23 pacemaker-fenced    [1796703] (fenced_process_fencing_reply) 	notice: Action 'on' targeting node2 using badfence on behalf of pacemaker-controld.1796707@node1: complete
Jul 11 15:51:39 fastvm-rhel-8-0-23 pacemaker-fenced    [1796703] (advance_topology_level) 	notice: All fencing options targeting node2 for client pacemaker-controld.1796707@node1 failed | id=e735ed02
Jul 11 15:51:39 fastvm-rhel-8-0-23 pacemaker-fenced    [1796703] (stonith_choose_peer) 	notice: Couldn't find anyone to fence (on) node2 using xvm
Jul 11 15:51:39 fastvm-rhel-8-0-23 pacemaker-fenced    [1796703] (request_peer_fencing) 	info: No peers (out of 2) are capable of fencing (on) node2 for client pacemaker-controld.1796707 | state=executing
Jul 11 15:51:39 fastvm-rhel-8-0-23 pacemaker-fenced    [1796703] (finalize_op) 	error: Operation 'on' targeting node2 by unknown node for pacemaker-controld.1796707@node1: Error occurred (No fence device) | id=e735ed02
Jul 11 15:51:39 fastvm-rhel-8-0-23 pacemaker-controld  [1796707] (tengine_stonith_callback) 	warning: Fence operation 35 for node2 failed: No fence device (aborting transition and giving up for now)
Jul 11 15:51:39 fastvm-rhel-8-0-23 pacemaker-controld  [1796707] (abort_transition_graph) 	notice: Transition 78 aborted: Stonith failed | source=abort_for_stonith_failure:257 complete=false
Jul 11 15:51:39 fastvm-rhel-8-0-23 pacemaker-controld  [1796707] (handle_fence_notification) 	error: Unfencing of node2 by the cluster failed (No fence device) with exit status 1

Comment 2 Reid Wahl 2022-07-12 04:03:24 UTC
(In reply to Reid Wahl from comment #1)
I didn't realize that I had switched the stonith devices back from fence_dummy_auto_unfence to fence_dummy. The issue in comment 0 is still valid. Comments on the others below.

> 1. Below, I've added a working "xvm" fence device to the front of level 1.
> Now, xvm passes its "on" action and badfence fails its "on" action, but we
> move on to level 2 and honor level 2's successful result.

This does not apply to auto_unfence, so it's behaving as designed.

 
> 2. Finally, I've added to the beginning of **both** levels an xvm stonith
> device that will pass its "on" action. This time, we pass xvm in level 1,
> fail badfence in level 1, and then do not even attempt level 2. We
> short-circuit after level 1.

I opened BZ2106182 for this. It happens without auto unfencing and IIRC also with auto unfencing.


Note You need to log in before you can comment on or make changes to this bug.