Customer environment: Red Hat OpenStack Platform based on RHOSP 16.2 Z4 (RHEL 8.4 [1]) Description of problem: One of the controller nodes had a very serious hardware issue and the node shut itself down, Pacemaker tried to power it back on via its IPMI device but the BMC refused the power-on command. At this point, all resources owned by the node transitioned into UNCLEAN and were left in that state even though the node has SBD as a second-level fence device defined. We had to manually confirm the STONITH action before Pacemaker brought the affected resources back online. We're using a single virtual fence_watchdog resource attached to all three controllers, as directed by your support in a previous exchange. The controller died on February 4th 21:08:06 CET and we performed a manual stonith confirm on February 6th 09:11:55. What is the business impact? Please also provide time frame information. pre-production environment, loss of API availability Could you help to understand why the cluster wasn't able to recover from this HW issue ? Is there any fix feasible to prevent this issue in the future ? [1] # cat os-release NAME="Red Hat Enterprise Linux" VERSION="8.4 (Ootpa)" ID="rhel" ID_LIKE="fedora" VERSION_ID="8.4" PLATFORM_ID="platform:el8" PRETTY_NAME="Red Hat Enterprise Linux 8.4 (Ootpa)" ANSI_COLOR="0;31" CPE_NAME="cpe:/o:redhat:enterprise_linux:8.4:GA" HOME_URL="https://www.redhat.com/" DOCUMENTATION_URL="https://access.redhat.com/documentation/red_hat_enterprise_linux/8/" BUG_REPORT_URL="https://bugzilla.redhat.com/" REDHAT_BUGZILLA_PRODUCT="Red Hat Enterprise Linux 8" REDHAT_BUGZILLA_PRODUCT_VERSION=8.4 REDHAT_SUPPORT_PRODUCT="Red Hat Enterprise Linux" REDHAT_SUPPORT_PRODUCT_VERSION="8.4" [root@rbruzzon etc]# # grep pacemaker installed-rpms ansible-pacemaker-1.0.4-2.20210623224811.666f706.el8ost.noarch Wed Jul 27 09:20:05 2022 pacemaker-2.0.5-9.el8_4.5.x86_64 Wed Jul 27 09:17:56 2022 pacemaker-cli-2.0.5-9.el8_4.5.x86_64 Wed Jul 27 09:17:55 2022 pacemaker-cluster-libs-2.0.5-9.el8_4.5.x86_64 Wed Jul 27 09:17:56 2022 pacemaker-libs-2.0.5-9.el8_4.5.x86_64 Wed Jul 27 09:17:55 2022 pacemaker-remote-2.0.5-9.el8_4.5.x86_64 Wed Jul 27 09:18:54 2022 pacemaker-schemas-2.0.5-9.el8_4.5.noarch Wed Jul 27 09:17:55 2022 puppet-pacemaker-1.4.1-2.20220422005549.e0a869c.el8ost.noarch Tue Jan 24 15:39:11 2023
The "watchdog is not eligible to fence <node>" part is reproducible as of pacemaker-2.1.4-5.el9_1.2.x86_64. The following code appears to add only the local node (and no other nodes) to device->targets, so that we can use the watchdog device only to fence ourselves and not to fence any other node. - https://github.com/ClusterLabs/pacemaker/blame/Pacemaker-2.1.4/daemons/fenced/fenced_commands.c#L1344-L1361 If that's intended behavior, then I'm not sure why. With that being said, on my reproducer cluster (using newer packages), so far there's no real issue: yes, we're unable to use `watchdog` to fence the node, but the node is declared "fenced" anyway after stonith-watchdog-timeout expires. Then resources are recovered. Maybe it's related to the particular timeouts in place, as proposed in chat, or perhaps it's an issue in an older version.
(In reply to Reid Wahl from comment #5) > > If that's intended behavior, then I'm not sure why. > Behavior is intended - sort of. If the node to be watchdog-fenced is alive it will then advertise capability to self-fence and - assuming there is no other reason here to self fence like being unclean or without quorum - get the job to do so. Problem I'm seeing is how timeout for a topology is evaluated. If there are 2 levels the timeout will be derived to 2x the standard timeout. So if the fence-action on the 1st level times out we still have a standard-timeout for watchdog-fencing to take place. This is what is observed when the node to be watchdog-fenced is available. In case this node isn't available, timeout for the full topology is derived to just 1x standard timeout as watchdog-fencing - as pointed out above - isn't reported as available by the node. In this scenario all timeout is used up by the first level timing out and thus we haven't got enough time left for the watchdog-timeout. Haven't checked but this optimized timeout evaluation might have been introduced after playing with this scenario initially - or first level timing out was actually never tested as we always had other errors - don't remember ... The least intrusive way to cope with this would probably be to make topology overall-timeout calculation watchdog-aware in a sense that it would still add one timeout for watchdog-fencing - even if reported not available. This would already correct the behavior. When we entirely suppress the unavailable-message for watchdog-fencing we would loose information. Thus I'd recommend tweaking it in a way that it becomes more obvious what is going on. These special cases for watchdog-fencing are kind of ugly and thus alternative ideas are welcome.
https://github.com/ClusterLabs/pacemaker/pull/3026 @nwahl: Do you think we still need the additional logs?
(In reply to Klaus Wenninger from comment #9) > https://github.com/ClusterLabs/pacemaker/pull/3026 > > @nwahl: Do you think we still need the additional logs? It seems reproducible so probably not, but it's good to know that we have them now :)