Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 2084626

Summary:	fence_ipmilan should not leave resources in UNCLEAN state when status of failed node is known
Product:	Red Hat Enterprise Linux 8	Reporter:	Matthew Secaur <msecaur>
Component:	fence-agents	Assignee:	Oyvind Albrigtsen <oalbrigt>
Status:	CLOSED MIGRATED	QA Contact:	cluster-qe <cluster-qe>
Severity:	unspecified	Docs Contact:
Priority:	unspecified
Version:	8.4	CC:	broose, cluster-maint, kgaillot, nwahl, oalbrigt, sbradley
Target Milestone:	rc	Keywords:	MigratedToJIRA
Target Release:	---	Flags:	pm-rhel: mirror+
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2023-09-22 19:39:16 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Matthew Secaur 2022-05-12 14:31:29 UTC

Description of problem:
During a hardware failure in a PCS cluster, fencing via IPMI failed since the node was physically broken. As a result, all resources on that node were UNCLEAN. However, the BMC for the node was still reporting correctly that the node was OFF. Therefore, the cluster resources should have been restarted on a surviving node instead of becoming UNCLEAN.

Version-Release number of selected component (if applicable):
fence-agents-ipmilan-4.2.1-89.el8

How reproducible:
I have reproduced the customer issue in my lab using KVM and virtualBMC. However, the customer environment was using physical nodes in an OpenStack 16.1 environment.

Steps to Reproduce:
1. Create a cluster that uses ipmilan for fencing
2. Ensure some resources are running on the node that will "fail"
3. Cause issue in the node that will prevent it from booting (in my lab, I simply moved the qcow2 used for booting to another name, but in the CU environment, the mainboard failed).
4. Trigger fencing on the node
5. After some time, the node and resources will become UNCLEAN
6. 'fence_ipmilan ... -o status' will show that the node is "OFF"

Actual results:
Even though ipmilan is able to query the status of the broken node, and the node is OFF, the resources still become UNCLEAN and do not fail over until the user manually confirms the fencing action.

Expected results:
So long as the BMC/ILO is able to report the status of the node as being "OFF", the cluster resources should be restarted on surviving nodes without manual intervention.

Additional info:

Comment 1 Matthew Secaur 2022-05-12 14:35:08 UTC

Here are the details in my reproducer environment:

# pcs status
Cluster name: lab
Cluster Summary:
  * Stack: corosync
  * Current DC: node1.matt.lab (version 2.1.2-4.el8-ada5c3b36e2) - partition with quorum
  * Last updated: Thu May 12 09:27:01 2022
  * Last change:  Thu May 12 09:26:29 2022 by root via crm_resource on node1.matt.lab
  * 3 nodes configured
  * 6 resource instances configured

Node List:
  * Online: [ node1.matt.lab node2.matt.lab node3.matt.lab ]

Full List of Resources:
  * Resource Group: IP:
    * IP-172.16.0.35	(ocf::heartbeat:IPaddr2):	 Started node2.matt.lab
    * IP-172.16.0.36	(ocf::heartbeat:IPaddr2):	 Started node2.matt.lab
    * IP-172.16.0.37	(ocf::heartbeat:IPaddr2):	 Started node2.matt.lab
  * ipmi-node3	(stonith:fence_ipmilan):	 Started node1.matt.lab
  * ipmi-node2	(stonith:fence_ipmilan):	 Started node3.matt.lab
  * ipmi-node1	(stonith:fence_ipmilan):	 Started node3.matt.lab

Daemon Status:
  corosync: active/enabled
  pacemaker: active/enabled
  pcsd: active/enabled
  
# pcs stonith config
 Resource: ipmi-node3 (class=stonith type=fence_ipmilan)
  Attributes: ip=192.168.0.4 ipport=6236 lanplus=1 password=redhat pcmk_host_list=node3 username=admin
  Operations: monitor interval=120s (ipmi-node3-monitor-interval-120s)
 Resource: ipmi-node2 (class=stonith type=fence_ipmilan)
  Attributes: ip=192.168.0.5 ipport=6236 lanplus=1 password=redhat pcmk_host_list=node2 username=admin
  Operations: monitor interval=120s (ipmi-node2-monitor-interval-120s)
 Resource: ipmi-node1 (class=stonith type=fence_ipmilan)
  Attributes: ip=192.168.0.6 ipport=6236 lanplus=1 password=redhat pcmk_host_list=node1 username=admin
  Operations: monitor interval=120s (ipmi-node1-monitor-interval-120s)

(hypervisor)# mv node2.qcow2 node2.qcow2_broken  <---- This step is to ensure that node2 will not boot when it is fenced

# fence_ipmilan -a 192.168.0.5 --username=admin --password=redhat --lanplus --ipport=6236 -o status
Status: ON

# pcs stonith fence node2
Error: unable to fence 'node2'
stonith_admin: Couldn't fence node2: Timer expired (Fence agent did not complete in time)

# fence_ipmilan -a 192.168.0.5 --username=admin --password=redhat --lanplus --ipport=6236 -o status
Status: OFF                   <----- We know the status of the failed node.

# pcs status
Cluster name: lab
Cluster Summary:
  * Stack: corosync
  * Current DC: node1.matt.lab (version 2.1.2-4.el8-ada5c3b36e2) - partition with quorum
  * Last updated: Thu May 12 09:30:49 2022
  * Last change:  Thu May 12 09:26:29 2022 by root via crm_resource on node1.matt.lab
  * 3 nodes configured
  * 6 resource instances configured

Node List:
  * Node node2.matt.lab: UNCLEAN (offline)
  * Online: [ node1.matt.lab node3.matt.lab ]

Full List of Resources:
  * Resource Group: IP:
    * IP-172.16.0.35	(ocf::heartbeat:IPaddr2):	 Started node2.matt.lab (UNCLEAN)
    * IP-172.16.0.36	(ocf::heartbeat:IPaddr2):	 Started node2.matt.lab (UNCLEAN)
    * IP-172.16.0.37	(ocf::heartbeat:IPaddr2):	 Started node2.matt.lab (UNCLEAN)
  * ipmi-node3	(stonith:fence_ipmilan):	 Started node1.matt.lab
  * ipmi-node2	(stonith:fence_ipmilan):	 Started node3.matt.lab
  * ipmi-node1	(stonith:fence_ipmilan):	 Started node3.matt.lab

Failed Fencing Actions:
  * reboot of node2 failed (Fence agent did not complete in time): delegate=node3.matt.lab, client=stonith_admin.44518, origin=node1.matt.lab, last-failed='2022-05-12 09:30:39 -04:00' 
  * reboot of node2.matt.lab failed: delegate=, client=pacemaker-controld.24473, origin=node1.matt.lab, last-failed='2022-05-12 09:28:48 -04:00' 
  * reboot of node3 failed (Fence agent did not complete in time): delegate=node1.matt.lab, client=stonith_admin.43491, origin=node1.matt.lab, last-failed='2022-05-12 09:11:42 -04:00'

Daemon Status:
  corosync: active/enabled
  pacemaker: active/enabled
  pcsd: active/enabled
  
# pcs stonith confirm node2.matt.lab   <---- Obviously, this works, but it shouldn't be necessary.
WARNING: If node node2.matt.lab is not powered off or it does have access to shared resources, data corruption and/or cluster failure may occur. Are you sure you want to continue? [y/N] y
Node: node2.matt.lab confirmed fenced

# pcs status
Cluster name: lab
Cluster Summary:
  * Stack: corosync
  * Current DC: node1.matt.lab (version 2.1.2-4.el8-ada5c3b36e2) - partition with quorum
  * Last updated: Thu May 12 09:32:37 2022
  * Last change:  Thu May 12 09:26:29 2022 by root via crm_resource on node1.matt.lab
  * 3 nodes configured
  * 6 resource instances configured

Node List:
  * Online: [ node1.matt.lab node3.matt.lab ]
  * OFFLINE: [ node2.matt.lab ]

Full List of Resources:
  * Resource Group: IP:
    * IP-172.16.0.35	(ocf::heartbeat:IPaddr2):	 Started node1.matt.lab
    * IP-172.16.0.36	(ocf::heartbeat:IPaddr2):	 Started node1.matt.lab
    * IP-172.16.0.37	(ocf::heartbeat:IPaddr2):	 Started node1.matt.lab
  * ipmi-node3	(stonith:fence_ipmilan):	 Started node1.matt.lab
  * ipmi-node2	(stonith:fence_ipmilan):	 Started node3.matt.lab
  * ipmi-node1	(stonith:fence_ipmilan):	 Started node3.matt.lab

Failed Fencing Actions:
  * reboot of node2 failed (Fence agent did not complete in time): delegate=node3.matt.lab, client=stonith_admin.44518, origin=node1.matt.lab, last-failed='2022-05-12 09:30:39 -04:00' 
  * reboot of node2.matt.lab failed: delegate=, client=pacemaker-controld.24473, origin=node1.matt.lab, last-failed='2022-05-12 09:28:48 -04:00' 
  * reboot of node3 failed (Fence agent did not complete in time): delegate=node1.matt.lab, client=stonith_admin.43491, origin=node1.matt.lab, last-failed='2022-05-12 09:11:42 -04:00'

Daemon Status:
  corosync: active/enabled
  pacemaker: active/enabled
  pcsd: active/enabled

Comment 2 Ken Gaillot 2022-05-12 14:43:15 UTC

Hi,

In this case, the error was a timeout, so it's possible the command did succeed later (there might be a result reported in the log). In that case the fix would simply be to raise the timeout in the configuration (e.g. pcmk_reboot_timeout for the ipmi-* resources). Pacemaker has to respect the timeout; if the command takes longer than that, even if it later succeeds, Pacemaker has to immediately begin handling it as a fencing failure.

If the timeout was due to the agent hanging after some unexpected sequence, then we can reassign this to fence-agents for further investigation.

Comment 3 Matthew Secaur 2022-05-12 15:09:33 UTC

Hi, Ken,

Thanks for your prompt response. However, the customer issue was a mainboard failure, so the node was totally unavailable until it was repaired. So increasing the timeout (unless we increase it to several *days*) is not the solution, especially when the resources are out-of-service that whole time. Sure, we can get them back in service with 'pcs stonith confirm ...' but this should not be required when we know the status of the node.

(N.B. if we did *not* know the status of the node, e.g. the BMC itself failed, then UNCLEAN makes sense. But it does not make sense in this case where the status of the node is knowable)

If this BZ should be in another group, please let me know and I can reassign it.

Thanks again.

Comment 4 Ken Gaillot 2022-05-12 17:13:51 UTC

(In reply to Matthew Secaur from comment #3)
> Hi, Ken,
> 
> Thanks for your prompt response. However, the customer issue was a mainboard
> failure, so the node was totally unavailable until it was repaired. So
> increasing the timeout (unless we increase it to several *days*) is not the
> solution, especially when the resources are out-of-service that whole time.
> Sure, we can get them back in service with 'pcs stonith confirm ...' but
> this should not be required when we know the status of the node.
> 
> (N.B. if we did *not* know the status of the node, e.g. the BMC itself
> failed, then UNCLEAN makes sense. But it does not make sense in this case
> where the status of the node is knowable)
> 
> If this BZ should be in another group, please let me know and I can reassign
> it.
> 
> Thanks again.

Reassigning to fence-agents for further consideration.

If I follow you correctly, the IPMI off command will hang forever, even though the IPMI status command would immediately report that the node is already down. Really it sounds like this particular IPMI implementation is suboptimal (I'd expect off to return success if it's already off).

As a workaround, fence_ipmilan could potentially run a status check before trying to turn the node off. However I can see complications, e.g. if the status check would delay the fencing significantly, it might not be worth helping this case at the expense of the common case (at least not unconditionally -- maybe as a configurable option, or maybe a new fence_ipmilan_status agent that returns success if status reports that the host is already off and fails otherwise could be used in a topology with fence_ipmilan as a fallback).

As an aside, if the BMC shares the same power as the mainboard, the best practice is to have a backup fencing method (such as an intelligent power switch) in a topology. Otherwise the cluster can't recover from power failures.

Comment 17 RHEL Program Management 2023-09-22 19:36:58 UTC

Issue migration from Bugzilla to Jira is in process at this time. This will be the last message in Jira copied from the Bugzilla bug.

Comment 18 RHEL Program Management 2023-09-22 19:39:16 UTC

This BZ has been automatically migrated to the issues.redhat.com Red Hat Issue Tracker. All future work related to this report will be managed there.

Due to differences in account names between systems, some fields were not replicated.  Be sure to add yourself to Jira issue's "Watchers" field to continue receiving updates and add others to the "Need Info From" field to continue requesting information.

To find the migrated issue, look in the "Links" section for a direct link to the new issue location. The issue key will have an icon of 2 footprints next to it, and begin with "RHEL-" followed by an integer.  You can also find this issue by visiting https://issues.redhat.com/issues/?jql= and searching the "Bugzilla Bug" field for this BZ's number, e.g. a search like:

"Bugzilla Bug" = 1234567

In the event you have trouble locating or viewing this issue, you can file an issue by sending mail to rh-issues. You can also visit https://access.redhat.com/articles/7032570 for general account information.

Comment 19 Red Hat Bugzilla 2024-01-21 04:25:22 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days