Bug 2084626
| Summary: | fence_ipmilan should not leave resources in UNCLEAN state when status of failed node is known | ||
|---|---|---|---|
| Product: | Red Hat Enterprise Linux 8 | Reporter: | Matthew Secaur <msecaur> |
| Component: | fence-agents | Assignee: | Oyvind Albrigtsen <oalbrigt> |
| Status: | ASSIGNED --- | QA Contact: | cluster-qe <cluster-qe> |
| Severity: | unspecified | Docs Contact: | |
| Priority: | unspecified | ||
| Version: | 8.4 | CC: | broose, cluster-maint, kgaillot, nwahl, oalbrigt, sbradley |
| Target Milestone: | rc | Flags: | msecaur:
needinfo?
(nwahl) |
| Target Release: | --- | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | Type: | Bug | |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
Matthew Secaur
2022-05-12 14:31:29 UTC
Here are the details in my reproducer environment:
# pcs status
Cluster name: lab
Cluster Summary:
* Stack: corosync
* Current DC: node1.matt.lab (version 2.1.2-4.el8-ada5c3b36e2) - partition with quorum
* Last updated: Thu May 12 09:27:01 2022
* Last change: Thu May 12 09:26:29 2022 by root via crm_resource on node1.matt.lab
* 3 nodes configured
* 6 resource instances configured
Node List:
* Online: [ node1.matt.lab node2.matt.lab node3.matt.lab ]
Full List of Resources:
* Resource Group: IP:
* IP-172.16.0.35 (ocf::heartbeat:IPaddr2): Started node2.matt.lab
* IP-172.16.0.36 (ocf::heartbeat:IPaddr2): Started node2.matt.lab
* IP-172.16.0.37 (ocf::heartbeat:IPaddr2): Started node2.matt.lab
* ipmi-node3 (stonith:fence_ipmilan): Started node1.matt.lab
* ipmi-node2 (stonith:fence_ipmilan): Started node3.matt.lab
* ipmi-node1 (stonith:fence_ipmilan): Started node3.matt.lab
Daemon Status:
corosync: active/enabled
pacemaker: active/enabled
pcsd: active/enabled
# pcs stonith config
Resource: ipmi-node3 (class=stonith type=fence_ipmilan)
Attributes: ip=192.168.0.4 ipport=6236 lanplus=1 password=redhat pcmk_host_list=node3 username=admin
Operations: monitor interval=120s (ipmi-node3-monitor-interval-120s)
Resource: ipmi-node2 (class=stonith type=fence_ipmilan)
Attributes: ip=192.168.0.5 ipport=6236 lanplus=1 password=redhat pcmk_host_list=node2 username=admin
Operations: monitor interval=120s (ipmi-node2-monitor-interval-120s)
Resource: ipmi-node1 (class=stonith type=fence_ipmilan)
Attributes: ip=192.168.0.6 ipport=6236 lanplus=1 password=redhat pcmk_host_list=node1 username=admin
Operations: monitor interval=120s (ipmi-node1-monitor-interval-120s)
(hypervisor)# mv node2.qcow2 node2.qcow2_broken <---- This step is to ensure that node2 will not boot when it is fenced
# fence_ipmilan -a 192.168.0.5 --username=admin --password=redhat --lanplus --ipport=6236 -o status
Status: ON
# pcs stonith fence node2
Error: unable to fence 'node2'
stonith_admin: Couldn't fence node2: Timer expired (Fence agent did not complete in time)
# fence_ipmilan -a 192.168.0.5 --username=admin --password=redhat --lanplus --ipport=6236 -o status
Status: OFF <----- We know the status of the failed node.
# pcs status
Cluster name: lab
Cluster Summary:
* Stack: corosync
* Current DC: node1.matt.lab (version 2.1.2-4.el8-ada5c3b36e2) - partition with quorum
* Last updated: Thu May 12 09:30:49 2022
* Last change: Thu May 12 09:26:29 2022 by root via crm_resource on node1.matt.lab
* 3 nodes configured
* 6 resource instances configured
Node List:
* Node node2.matt.lab: UNCLEAN (offline)
* Online: [ node1.matt.lab node3.matt.lab ]
Full List of Resources:
* Resource Group: IP:
* IP-172.16.0.35 (ocf::heartbeat:IPaddr2): Started node2.matt.lab (UNCLEAN)
* IP-172.16.0.36 (ocf::heartbeat:IPaddr2): Started node2.matt.lab (UNCLEAN)
* IP-172.16.0.37 (ocf::heartbeat:IPaddr2): Started node2.matt.lab (UNCLEAN)
* ipmi-node3 (stonith:fence_ipmilan): Started node1.matt.lab
* ipmi-node2 (stonith:fence_ipmilan): Started node3.matt.lab
* ipmi-node1 (stonith:fence_ipmilan): Started node3.matt.lab
Failed Fencing Actions:
* reboot of node2 failed (Fence agent did not complete in time): delegate=node3.matt.lab, client=stonith_admin.44518, origin=node1.matt.lab, last-failed='2022-05-12 09:30:39 -04:00'
* reboot of node2.matt.lab failed: delegate=, client=pacemaker-controld.24473, origin=node1.matt.lab, last-failed='2022-05-12 09:28:48 -04:00'
* reboot of node3 failed (Fence agent did not complete in time): delegate=node1.matt.lab, client=stonith_admin.43491, origin=node1.matt.lab, last-failed='2022-05-12 09:11:42 -04:00'
Daemon Status:
corosync: active/enabled
pacemaker: active/enabled
pcsd: active/enabled
# pcs stonith confirm node2.matt.lab <---- Obviously, this works, but it shouldn't be necessary.
WARNING: If node node2.matt.lab is not powered off or it does have access to shared resources, data corruption and/or cluster failure may occur. Are you sure you want to continue? [y/N] y
Node: node2.matt.lab confirmed fenced
# pcs status
Cluster name: lab
Cluster Summary:
* Stack: corosync
* Current DC: node1.matt.lab (version 2.1.2-4.el8-ada5c3b36e2) - partition with quorum
* Last updated: Thu May 12 09:32:37 2022
* Last change: Thu May 12 09:26:29 2022 by root via crm_resource on node1.matt.lab
* 3 nodes configured
* 6 resource instances configured
Node List:
* Online: [ node1.matt.lab node3.matt.lab ]
* OFFLINE: [ node2.matt.lab ]
Full List of Resources:
* Resource Group: IP:
* IP-172.16.0.35 (ocf::heartbeat:IPaddr2): Started node1.matt.lab
* IP-172.16.0.36 (ocf::heartbeat:IPaddr2): Started node1.matt.lab
* IP-172.16.0.37 (ocf::heartbeat:IPaddr2): Started node1.matt.lab
* ipmi-node3 (stonith:fence_ipmilan): Started node1.matt.lab
* ipmi-node2 (stonith:fence_ipmilan): Started node3.matt.lab
* ipmi-node1 (stonith:fence_ipmilan): Started node3.matt.lab
Failed Fencing Actions:
* reboot of node2 failed (Fence agent did not complete in time): delegate=node3.matt.lab, client=stonith_admin.44518, origin=node1.matt.lab, last-failed='2022-05-12 09:30:39 -04:00'
* reboot of node2.matt.lab failed: delegate=, client=pacemaker-controld.24473, origin=node1.matt.lab, last-failed='2022-05-12 09:28:48 -04:00'
* reboot of node3 failed (Fence agent did not complete in time): delegate=node1.matt.lab, client=stonith_admin.43491, origin=node1.matt.lab, last-failed='2022-05-12 09:11:42 -04:00'
Daemon Status:
corosync: active/enabled
pacemaker: active/enabled
pcsd: active/enabled
Hi, In this case, the error was a timeout, so it's possible the command did succeed later (there might be a result reported in the log). In that case the fix would simply be to raise the timeout in the configuration (e.g. pcmk_reboot_timeout for the ipmi-* resources). Pacemaker has to respect the timeout; if the command takes longer than that, even if it later succeeds, Pacemaker has to immediately begin handling it as a fencing failure. If the timeout was due to the agent hanging after some unexpected sequence, then we can reassign this to fence-agents for further investigation. Hi, Ken, Thanks for your prompt response. However, the customer issue was a mainboard failure, so the node was totally unavailable until it was repaired. So increasing the timeout (unless we increase it to several *days*) is not the solution, especially when the resources are out-of-service that whole time. Sure, we can get them back in service with 'pcs stonith confirm ...' but this should not be required when we know the status of the node. (N.B. if we did *not* know the status of the node, e.g. the BMC itself failed, then UNCLEAN makes sense. But it does not make sense in this case where the status of the node is knowable) If this BZ should be in another group, please let me know and I can reassign it. Thanks again. (In reply to Matthew Secaur from comment #3) > Hi, Ken, > > Thanks for your prompt response. However, the customer issue was a mainboard > failure, so the node was totally unavailable until it was repaired. So > increasing the timeout (unless we increase it to several *days*) is not the > solution, especially when the resources are out-of-service that whole time. > Sure, we can get them back in service with 'pcs stonith confirm ...' but > this should not be required when we know the status of the node. > > (N.B. if we did *not* know the status of the node, e.g. the BMC itself > failed, then UNCLEAN makes sense. But it does not make sense in this case > where the status of the node is knowable) > > If this BZ should be in another group, please let me know and I can reassign > it. > > Thanks again. Reassigning to fence-agents for further consideration. If I follow you correctly, the IPMI off command will hang forever, even though the IPMI status command would immediately report that the node is already down. Really it sounds like this particular IPMI implementation is suboptimal (I'd expect off to return success if it's already off). As a workaround, fence_ipmilan could potentially run a status check before trying to turn the node off. However I can see complications, e.g. if the status check would delay the fencing significantly, it might not be worth helping this case at the expense of the common case (at least not unconditionally -- maybe as a configurable option, or maybe a new fence_ipmilan_status agent that returns success if status reports that the host is already off and fails otherwise could be used in a topology with fence_ipmilan as a fallback). As an aside, if the BMC shares the same power as the mainboard, the best practice is to have a backup fencing method (such as an intelligent power switch) in a topology. Otherwise the cluster can't recover from power failures. |