Bug 2106170
| Summary: | Unfencing should fail immediately if one automatic unfencing device fails | ||
|---|---|---|---|
| Product: | Red Hat Enterprise Linux 8 | Reporter: | Reid Wahl <nwahl> |
| Component: | pacemaker | Assignee: | Ken Gaillot <kgaillot> |
| Status: | NEW --- | QA Contact: | cluster-qe <cluster-qe> |
| Severity: | low | Docs Contact: | |
| Priority: | low | ||
| Version: | 8.6 | CC: | cluster-maint |
| Target Milestone: | rc | Keywords: | Triaged |
| Target Release: | --- | ||
| Hardware: | All | ||
| OS: | Linux | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | Type: | Enhancement | |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
Reid Wahl
2022-07-11 22:40:41 UTC
Perhaps more importantly, the behavior is inconsistent. Let's look at two more scenarios. All three scenarios exhibit different behavior. It seems to me that they should behave the same, regardless of what we define as the correct behavior. 1. Below, I've added a working "xvm" fence device to the front of level 1. Now, xvm passes its "on" action and badfence fails its "on" action, but we move on to level 2 and honor level 2's successful result. [root@fastvm-rhel-8-0-23 ~]# pcs stonith config Resource: goodfence (class=stonith type=fence_dummy) Attributes: mode=pass monitor_mode=pass pcmk_host_list="node1 node2" Meta Attrs: provides=unfencing Operations: monitor interval=60s (goodfence-monitor-interval-60s) Resource: badfence (class=stonith type=fence_dummy) Attributes: mode=fail monitor_mode=pass pcmk_host_list="node1 node2" Operations: monitor interval=60s (badfence-monitor-interval-60s) Resource: xvm (class=stonith type=fence_xvm) Attributes: pcmk_host_map=node1:fastvm-rhel-8.0-23;node2:fastvm-rhel-8.0-24 Operations: monitor interval=60s (xvm-monitor-interval-60s) Target: node2 Level 1 - xvm,badfence Level 2 - goodfence Jul 11 15:45:00 fastvm-rhel-8-0-23 pacemaker-controld [1796707] (te_fence_node) notice: Requesting fencing (on) of node node2 | action=4 timeout=60000 Jul 11 15:45:00 fastvm-rhel-8-0-23 pacemaker-fenced [1796703] (handle_request) notice: Client pacemaker-controld.1796707 wants to fence (on) node2 using any device Jul 11 15:45:00 fastvm-rhel-8-0-23 pacemaker-fenced [1796703] (initiate_remote_stonith_op) notice: Requesting peer fencing (on) targeting node2 | id=aaaea5b8 state=querying base_timeout=60 Jul 11 15:45:00 fastvm-rhel-8-0-23 pacemaker-fenced [1796703] (can_fence_host_with_device) notice: xvm is eligible to fence (on) node2 (aka. 'fastvm-rhel-8.0-24'): static-list Jul 11 15:45:00 fastvm-rhel-8-0-23 pacemaker-fenced [1796703] (process_remote_stonith_query) info: Query result 1 of 2 from node1 for node2/on (1 device) aaaea5b8-d34a-457b-a821-73b2bd716038 Jul 11 15:45:00 fastvm-rhel-8-0-23 pacemaker-fenced [1796703] (process_remote_stonith_query) info: Query result 2 of 2 from node2 for node2/on (2 devices) aaaea5b8-d34a-457b-a821-73b2bd716038 Jul 11 15:45:00 fastvm-rhel-8-0-23 pacemaker-fenced [1796703] (request_peer_fencing) info: Total timeout set to 180 for peer's fencing targeting node2 for pacemaker-controld.1796707|id=aaaea5b8 Jul 11 15:45:00 fastvm-rhel-8-0-23 pacemaker-fenced [1796703] (request_peer_fencing) notice: Requesting that node1 perform 'on' action targeting node2 using xvm | for client pacemaker-controld.1796707 (72s) Jul 11 15:45:00 fastvm-rhel-8-0-23 pacemaker-fenced [1796703] (log_async_result) notice: Operation 'on' [1797783] targeting node2 using xvm returned 0 | call 33 from pacemaker-controld.1796707 Jul 11 15:45:00 fastvm-rhel-8-0-23 pacemaker-fenced [1796703] (fenced_process_fencing_reply) notice: Action 'on' targeting node2 using xvm on behalf of pacemaker-controld.1796707@node1: complete Jul 11 15:45:00 fastvm-rhel-8-0-23 pacemaker-fenced [1796703] (request_peer_fencing) notice: Requesting that node2 perform 'on' action targeting node2 using badfence | for client pacemaker-controld.1796707 (72s) Jul 11 15:45:02 fastvm-rhel-8-0-24 pacemaker-fenced [142828] (log_async_result) error: Operation 'on' [142841] targeting node2 using badfence returned 1 | call 33 from pacemaker-controld.1796707 Jul 11 15:45:02 fastvm-rhel-8-0-23 pacemaker-fenced [1796703] (fenced_process_fencing_reply) notice: Action 'on' targeting node2 using badfence on behalf of pacemaker-controld.1796707@node1: complete Jul 11 15:45:02 fastvm-rhel-8-0-23 pacemaker-fenced [1796703] (request_peer_fencing) notice: Requesting that node2 perform 'on' action targeting node2 using goodfence | for client pacemaker-controld.1796707 (72s) Jul 11 15:45:02 fastvm-rhel-8-0-23 pacemaker-fenced [1796703] (fenced_process_fencing_reply) notice: Action 'on' targeting node2 using goodfence on behalf of pacemaker-controld.1796707@node1: complete Jul 11 15:45:02 fastvm-rhel-8-0-24 pacemaker-fenced [142828] (log_async_result) notice: Operation 'on' [142843] targeting node2 using goodfence returned 0 | call 33 from pacemaker-controld.1796707 Jul 11 15:45:02 fastvm-rhel-8-0-24 pacemaker-fenced [142828] (finalize_op) notice: Operation 'on' targeting node2 by node2 for pacemaker-controld.1796707@node1: OK (complete) | id=aaaea5b8 Jul 11 15:45:02 fastvm-rhel-8-0-23 pacemaker-controld [1796707] (tengine_stonith_callback) notice: Fence operation 33 for node2 passed Jul 11 15:45:02 fastvm-rhel-8-0-23 pacemaker-controld [1796707] (handle_fence_notification) notice: node2 was unfenced by node2 at the request of pacemaker-controld.1796707@node1 --- 2. Finally, I've added to the beginning of **both** levels an xvm stonith device that will pass its "on" action. This time, we pass xvm in level 1, fail badfence in level 1, and then do not even attempt level 2. We short-circuit after level 1. [root@fastvm-rhel-8-0-23 ~]# pcs stonith config Resource: goodfence (class=stonith type=fence_dummy) Attributes: mode=pass monitor_mode=pass pcmk_host_list="node1 node2" Meta Attrs: provides=unfencing Operations: monitor interval=60s (goodfence-monitor-interval-60s) Resource: badfence (class=stonith type=fence_dummy) Attributes: mode=fail monitor_mode=pass pcmk_host_list="node1 node2" Operations: monitor interval=60s (badfence-monitor-interval-60s) Resource: xvm (class=stonith type=fence_xvm) Attributes: pcmk_host_map=node1:fastvm-rhel-8.0-23;node2:fastvm-rhel-8.0-24 Operations: monitor interval=60s (xvm-monitor-interval-60s) Target: node2 Level 1 - xvm,badfence Level 2 - xvm,goodfence Jul 11 15:51:38 fastvm-rhel-8-0-23 pacemaker-controld [1796707] (te_fence_node) notice: Requesting fencing (on) of node node2 | action=4 timeout=60000 Jul 11 15:51:38 fastvm-rhel-8-0-23 pacemaker-fenced [1796703] (handle_request) notice: Client pacemaker-controld.1796707 wants to fence (on) node2 using any device Jul 11 15:51:38 fastvm-rhel-8-0-23 pacemaker-fenced [1796703] (initiate_remote_stonith_op) notice: Requesting peer fencing (on) targeting node2 | id=e735ed02 state=querying base_timeout=60 Jul 11 15:51:38 fastvm-rhel-8-0-23 pacemaker-fenced [1796703] (can_fence_host_with_device) notice: xvm is eligible to fence (on) node2 (aka. 'fastvm-rhel-8.0-24'): static-list Jul 11 15:51:38 fastvm-rhel-8-0-23 pacemaker-fenced [1796703] (process_remote_stonith_query) info: Query result 1 of 2 from node1 for node2/on (1 device) e735ed02-0faf-4eed-b993-7afb1cc27d12 Jul 11 15:51:38 fastvm-rhel-8-0-23 pacemaker-fenced [1796703] (process_remote_stonith_query) info: Query result 2 of 2 from node2 for node2/on (2 devices) e735ed02-0faf-4eed-b993-7afb1cc27d12 Jul 11 15:51:38 fastvm-rhel-8-0-23 pacemaker-fenced [1796703] (request_peer_fencing) info: Total timeout set to 240 for peer's fencing targeting node2 for pacemaker-controld.1796707|id=e735ed02 Jul 11 15:51:38 fastvm-rhel-8-0-23 pacemaker-fenced [1796703] (request_peer_fencing) notice: Requesting that node1 perform 'on' action targeting node2 using xvm | for client pacemaker-controld.1796707 (72s) Jul 11 15:51:38 fastvm-rhel-8-0-23 pacemaker-fenced [1796703] (log_async_result) notice: Operation 'on' [1797939] targeting node2 using xvm returned 0 | call 35 from pacemaker-controld.1796707 Jul 11 15:51:38 fastvm-rhel-8-0-23 pacemaker-fenced [1796703] (fenced_process_fencing_reply) notice: Action 'on' targeting node2 using xvm on behalf of pacemaker-controld.1796707@node1: complete Jul 11 15:51:38 fastvm-rhel-8-0-23 pacemaker-fenced [1796703] (request_peer_fencing) notice: Requesting that node2 perform 'on' action targeting node2 using badfence | for client pacemaker-controld.1796707 (72s) Jul 11 15:51:39 fastvm-rhel-8-0-23 pacemaker-fenced [1796703] (fenced_process_fencing_reply) notice: Action 'on' targeting node2 using badfence on behalf of pacemaker-controld.1796707@node1: complete Jul 11 15:51:39 fastvm-rhel-8-0-23 pacemaker-fenced [1796703] (advance_topology_level) notice: All fencing options targeting node2 for client pacemaker-controld.1796707@node1 failed | id=e735ed02 Jul 11 15:51:39 fastvm-rhel-8-0-23 pacemaker-fenced [1796703] (stonith_choose_peer) notice: Couldn't find anyone to fence (on) node2 using xvm Jul 11 15:51:39 fastvm-rhel-8-0-23 pacemaker-fenced [1796703] (request_peer_fencing) info: No peers (out of 2) are capable of fencing (on) node2 for client pacemaker-controld.1796707 | state=executing Jul 11 15:51:39 fastvm-rhel-8-0-23 pacemaker-fenced [1796703] (finalize_op) error: Operation 'on' targeting node2 by unknown node for pacemaker-controld.1796707@node1: Error occurred (No fence device) | id=e735ed02 Jul 11 15:51:39 fastvm-rhel-8-0-23 pacemaker-controld [1796707] (tengine_stonith_callback) warning: Fence operation 35 for node2 failed: No fence device (aborting transition and giving up for now) Jul 11 15:51:39 fastvm-rhel-8-0-23 pacemaker-controld [1796707] (abort_transition_graph) notice: Transition 78 aborted: Stonith failed | source=abort_for_stonith_failure:257 complete=false Jul 11 15:51:39 fastvm-rhel-8-0-23 pacemaker-controld [1796707] (handle_fence_notification) error: Unfencing of node2 by the cluster failed (No fence device) with exit status 1 (In reply to Reid Wahl from comment #1) I didn't realize that I had switched the stonith devices back from fence_dummy_auto_unfence to fence_dummy. The issue in comment 0 is still valid. Comments on the others below. > 1. Below, I've added a working "xvm" fence device to the front of level 1. > Now, xvm passes its "on" action and badfence fails its "on" action, but we > move on to level 2 and honor level 2's successful result. This does not apply to auto_unfence, so it's behaving as designed. > 2. Finally, I've added to the beginning of **both** levels an xvm stonith > device that will pass its "on" action. This time, we pass xvm in level 1, > fail badfence in level 1, and then do not even attempt level 2. We > short-circuit after level 1. I opened BZ2106182 for this. It happens without auto unfencing and IIRC also with auto unfencing. |