Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.
RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 2106182

Summary: Fenced node is left offline if level 1 of a topology fails after powering off the node
Product: Red Hat Enterprise Linux 8 Reporter: Reid Wahl <nwahl>
Component: pacemakerAssignee: Ken Gaillot <kgaillot>
Status: CLOSED INSUFFICIENT_DATA QA Contact: cluster-qe <cluster-qe>
Severity: medium Docs Contact:
Priority: medium    
Version: 8.6CC: cluster-maint
Target Milestone: rcKeywords: Triaged
Target Release: ---Flags: pm-rhel: mirror+
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2023-04-13 20:18:13 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Reid Wahl 2022-07-12 01:46:51 UTC
Description of problem:

Suppose we have the following stonith topology, such that xvm will succeed and badfence will fail. (Note: The behavior is identical if I put a separate fence_xvm device in level 2.) Similar behavior happens during unfencing, if that's required.

[root@fastvm-rhel-8-0-23 ~]# pcs stonith config
 Resource: xvm (class=stonith type=fence_xvm)
  Attributes: pcmk_host_map=node2:fastvm-rhel-8.0-24
  Operations: monitor interval=60s (xvm-monitor-interval-60s)
 Resource: badfence (class=stonith type=fence_dummy)
  Attributes: mode=fail monitor_mode=pass pcmk_host_list=node2
  Operations: monitor interval=60s (badfence-monitor-interval-60s)
 Target: node2
   Level 1 - xvm,badfence
   Level 2 - xvm


Then when we fence node 2 using `pcs stonith fence`, we leave it offline. First xvm powered off the node. Then while badfence was trying unsuccessfully to run its "off" action, the controller created a new reboot action that got merged. Shortly thereafter, badfence gave up, and we moved on to level 2 (xvm again). The "off" action passed, and then we declared the whole reboot action a success without performing the "on" action.

Jul 11 18:31:59 fastvm-rhel-8-0-23 pacemaker-fenced    [1867100] (handle_request) 	notice: Client stonith_admin.1867754 wants to fence (reboot) node2 using any device
Jul 11 18:31:59 fastvm-rhel-8-0-23 pacemaker-fenced    [1867100] (op_phase_off) 	info: Remapping multiple-device reboot targeting node2 to 'off' | id=14c5fe89
Jul 11 18:31:59 fastvm-rhel-8-0-23 pacemaker-fenced    [1867100] (initiate_remote_stonith_op) 	notice: Requesting peer fencing (off) targeting node2 | id=14c5fe89 state=querying base_timeout=120
Jul 11 18:31:59 fastvm-rhel-8-0-23 pacemaker-fenced    [1867100] (can_fence_host_with_device) 	notice: badfence is eligible to fence (reboot) node2: static-list
Jul 11 18:31:59 fastvm-rhel-8-0-23 pacemaker-fenced    [1867100] (can_fence_host_with_device) 	notice: xvm is eligible to fence (reboot) node2 (aka. 'fastvm-rhel-8.0-24'): static-list
Jul 11 18:31:59 fastvm-rhel-8-0-23 pacemaker-fenced    [1867100] (process_remote_stonith_query) 	info: Query result 1 of 2 from node1 for node2/off (2 devices) 14c5fe89-7ee5-49f9-a573-d50675c20d8b
Jul 11 18:31:59 fastvm-rhel-8-0-23 pacemaker-fenced    [1867100] (request_peer_fencing) 	info: Total timeout set to 360 for peer's fencing targeting node2 for stonith_admin.1867754|id=14c5fe89
Jul 11 18:31:59 fastvm-rhel-8-0-23 pacemaker-fenced    [1867100] (request_peer_fencing) 	notice: Requesting that node1 perform 'off' action targeting node2 using xvm | for client stonith_admin.1867754 (144s)
Jul 11 18:31:59 fastvm-rhel-8-0-23 pacemaker-fenced    [1867100] (process_remote_stonith_query) 	info: Query result 2 of 2 from node2 for node2/off (2 devices) 14c5fe89-7ee5-49f9-a573-d50675c20d8b
Jul 11 18:31:59 fastvm-rhel-8-0-23 pacemaker-fenced    [1867100] (log_async_result) 	notice: Operation 'off' [1867755] targeting node2 using xvm returned 0 | call 2 from stonith_admin.1867754
Jul 11 18:32:06 fastvm-rhel-8-0-23 pacemaker-based     [1867099] (node_left) 	info: Group cib event 4: node2 (node 2 pid 1971) left via cluster exit
Jul 11 18:32:06 fastvm-rhel-8-0-23 pacemaker-fenced    [1867100] (fenced_process_fencing_reply) 	notice: Action 'off' targeting node2 using xvm on behalf of stonith_admin.1867754@node1: complete
Jul 11 18:32:06 fastvm-rhel-8-0-23 pacemaker-fenced    [1867100] (request_peer_fencing) 	notice: Requesting that node1 perform 'off' action targeting node2 using badfence | for client stonith_admin.1867754 (144s)
Jul 11 18:32:06 fastvm-rhel-8-0-23 pacemaker-fenced    [1867100] (internal_stonith_action_execute) 	info: Attempt 2 to execute fence_dummy (off). remaining timeout is 120
...
Jul 11 18:32:07 fastvm-rhel-8-0-23 pacemaker-schedulerd[1867103] (pe_fence_node) 	warning: Cluster node node2 will be fenced: peer is no longer part of the cluster
Jul 11 18:32:07 fastvm-rhel-8-0-23 pacemaker-schedulerd[1867103] (determine_online_status) 	warning: Node node2 is unclean
Jul 11 18:32:07 fastvm-rhel-8-0-23 pacemaker-schedulerd[1867103] (stage6) 	warning: Scheduling Node node2 for STONITH
...
Jul 11 18:32:07 fastvm-rhel-8-0-23 pacemaker-controld  [1867104] (te_fence_node) 	notice: Requesting fencing (reboot) of node node2 | action=1 timeout=60000
Jul 11 18:32:07 fastvm-rhel-8-0-23 pacemaker-fenced    [1867100] (handle_request) 	notice: Client pacemaker-controld.1867104 wants to fence (reboot) node2 using any device
Jul 11 18:32:07 fastvm-rhel-8-0-23 pacemaker-fenced    [1867100] (merge_duplicates) 	notice: Merging fencing action 'reboot' targeting node2 originating from client pacemaker-controld.1867104 with identical request from stonith_admin.1867754@node1 | original=36862748 duplicate=14c5fe89 total_timeout=432s
Jul 11 18:32:07 fastvm-rhel-8-0-23 pacemaker-fenced    [1867100] (op_phase_off) 	info: Remapping multiple-device reboot targeting node2 to 'off' | id=36862748
Jul 11 18:32:07 fastvm-rhel-8-0-23 pacemaker-fenced    [1867100] (initiate_remote_stonith_op) 	info: Requesting peer fencing (off) targeting node2 (duplicate) | id=36862748
Jul 11 18:32:07 fastvm-rhel-8-0-23 pacemaker-fenced    [1867100] (log_op_output) 	info: fence_dummy_off_2[1867758] error output [ simulated off failure ]
Jul 11 18:32:07 fastvm-rhel-8-0-23 pacemaker-fenced    [1867100] (log_action) 	warning: fence_dummy[1867758] stderr: [ simulated off failure ]
Jul 11 18:32:07 fastvm-rhel-8-0-23 pacemaker-fenced    [1867100] (update_remaining_timeout) 	info: Attempted to execute agent fence_dummy (off) the maximum number of times (2) allowed
Jul 11 18:32:07 fastvm-rhel-8-0-23 pacemaker-fenced    [1867100] (log_async_result) 	error: Operation 'off' [1867758] targeting node2 using badfence returned 1 | call 2 from stonith_admin.1867754
Jul 11 18:32:07 fastvm-rhel-8-0-23 pacemaker-fenced    [1867100] (fenced_process_fencing_reply) 	notice: Action 'off' targeting node2 using badfence on behalf of stonith_admin.1867754@node1: complete
Jul 11 18:32:07 fastvm-rhel-8-0-23 pacemaker-fenced    [1867100] (undo_op_remap) 	info: Undoing remap of reboot targeting node2 for stonith_admin.1867754 | id=14c5fe89
Jul 11 18:32:07 fastvm-rhel-8-0-23 pacemaker-fenced    [1867100] (request_peer_fencing) 	notice: Requesting that node1 perform 'reboot' action targeting node2 using xvm | for client stonith_admin.1867754 (144s)
Jul 11 18:32:07 fastvm-rhel-8-0-23 pacemaker-fenced    [1867100] (log_async_result) 	notice: Operation 'reboot' [1867760] targeting node2 using xvm returned 0 | call 2 from stonith_admin.1867754
Jul 11 18:32:07 fastvm-rhel-8-0-23 pacemaker-fenced    [1867100] (fenced_process_fencing_reply) 	notice: Action 'reboot' targeting node2 using xvm on behalf of stonith_admin.1867754@node1: complete
Jul 11 18:32:07 fastvm-rhel-8-0-23 pacemaker-fenced    [1867100] (finalize_op) 	notice: Operation 'reboot' targeting node2 by node1 for stonith_admin.1867754@node1: OK (complete) | id=14c5fe89
Jul 11 18:32:07 fastvm-rhel-8-0-23 pacemaker-fenced    [1867100] (undo_op_remap) 	info: Undoing remap of reboot targeting node2 for pacemaker-controld.1867104 | id=36862748
Jul 11 18:32:07 fastvm-rhel-8-0-23 pacemaker-fenced    [1867100] (finalize_op) 	notice: Operation 'reboot' targeting node2 by node1 for pacemaker-controld.1867104@node1 (merged): OK (complete) | id=36862748
Jul 11 18:32:07 fastvm-rhel-8-0-23 pacemaker-controld  [1867104] (handle_fence_notification) 	notice: Peer node2 was terminated (reboot) by node1 on behalf of stonith_admin.1867754@node1: OK | event=14c5fe89-7ee5-49f9-a573-d50675c20d8b

-----

Version-Release number of selected component (if applicable):

pacemaker-2.1.2-4.el8_6.2.x86_64

-----

How reproducible:

Easily

-----

Steps to Reproduce:
1. Configure a stonith topology as shown in the description, and use `pcs stonith fence` or stonith_admin to fence node2.

-----

Actual results:

node2 is left offline. The reboot is declared complete after the "off" action of level 2's xvm device. The "on" action is not run.

-----

Expected results:

The "off" action and "on" action both complete before the reboot action is declared complete.

Comment 1 Ken Gaillot 2022-08-01 22:07:22 UTC
(In reply to Reid Wahl from comment #0)
<snip>
> (log_async_result) 	error: Operation 'off' [1867758] targeting node2 using
> badfence returned 1 | call 2 from stonith_admin.1867754
> Jul 11 18:32:07 fastvm-rhel-8-0-23 pacemaker-fenced    [1867100]
> (fenced_process_fencing_reply) 	notice: Action 'off' targeting node2 using
> badfence on behalf of stonith_admin.1867754@node1: complete
> Jul 11 18:32:07 fastvm-rhel-8-0-23 pacemaker-fenced    [1867100]
> (undo_op_remap) 	info: Undoing remap of reboot targeting node2 for
> stonith_admin.1867754 | id=14c5fe89
> Jul 11 18:32:07 fastvm-rhel-8-0-23 pacemaker-fenced    [1867100]
> (request_peer_fencing) 	notice: Requesting that node1 perform 'reboot'
> action targeting node2 using xvm | for client stonith_admin.1867754 (144s)
> Jul 11 18:32:07 fastvm-rhel-8-0-23 pacemaker-fenced    [1867100]
> (log_async_result) 	notice: Operation 'reboot' [1867760] targeting node2
> using xvm returned 0 | call 2 from stonith_admin.1867754

It looks like the initiating node did the right thing: it asked node1 to reboot (not turn off) node2 using the xvm device. So my first thought is that either node1 somehow messed that up and executed off instead, or the xvm agent did. What do the logs on that node say?

Comment 2 Red Hat Bugzilla 2023-09-18 04:41:45 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days