Bug 2168633

Summary:	[BDI] Pacemaker resources left UNCLEAN after controller node failure
Product:	Red Hat Enterprise Linux 8	Reporter:	Riccardo Bruzzone <rbruzzon>
Component:	pacemaker	Assignee:	Klaus Wenninger <kwenning>
Status:	CLOSED ERRATA	QA Contact:	cluster-qe <cluster-qe>
Severity:	urgent	Docs Contact:	Steven J. Levine <slevine>
Priority:	high
Version:	8.4	CC:	cfeist, cluster-maint, jrehova, kgaillot, kwenning, lmiccini, matteo.panella, mjuricek, nwahl, sbradley, slevine
Target Milestone:	rc	Keywords:	Triaged, ZStream
Target Release:	8.9	Flags:	pm-rhel: mirror+
Hardware:	x86_64
OS:	All
Whiteboard:
Fixed In Version:	pacemaker-2.1.6-1.el8	Doc Type:	Bug Fix
Doc Text:	.A fence watchdog configured as a second fencing device now fences a node when the first device times out Previously, when a watchdog fencing device was configured as the second device in a fencing topology, the watchdog timeout would not be considered when calculating the timeout for the fencing operation. As a result, if the first device timed out the fencing operation would time out even though the watchdog would fence the node. With this fix, the watchdog timeout is included in the fencing operation timeout and the fencing operation succeeds if the first device times out.	Story Points:	---
Clone Of:
Clones:	2182482 2187419 2187421 2187422 2187423 (view as bug list)		Environment:
Last Closed:	2023-11-14 15:32:36 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:	2.1.6
Embargoed:
Bug Depends On:
Bug Blocks:	2182482, 2187419, 2187421, 2187422, 2187423

Description Riccardo Bruzzone 2023-02-09 15:38:21 UTC

Customer environment:

Red Hat OpenStack Platform based on RHOSP 16.2 Z4 (RHEL 8.4 [1])


Description of problem:

One of the controller nodes had a very serious hardware issue and the node shut itself down, Pacemaker tried to power it back on via its IPMI device but the BMC refused the power-on command.

At this point, all resources owned by the node transitioned into UNCLEAN and were left in that state even though the node has SBD as a second-level fence device defined. We had to manually confirm the STONITH action before Pacemaker brought the affected resources back online.
We're using a single virtual fence_watchdog resource attached to all three controllers, as directed by your support in a previous exchange.

The controller died on February 4th 21:08:06 CET and we performed a manual stonith confirm on February 6th 09:11:55.

What is the business impact? Please also provide time frame information.
pre-production environment, loss of API availability

Could you help to understand why the cluster wasn't able to recover from this HW issue ?
Is there any fix feasible to prevent this issue in the future ?


[1]

# cat os-release 
NAME="Red Hat Enterprise Linux"
VERSION="8.4 (Ootpa)"
ID="rhel"
ID_LIKE="fedora"
VERSION_ID="8.4"
PLATFORM_ID="platform:el8"
PRETTY_NAME="Red Hat Enterprise Linux 8.4 (Ootpa)"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:redhat:enterprise_linux:8.4:GA"
HOME_URL="https://www.redhat.com/"
DOCUMENTATION_URL="https://access.redhat.com/documentation/red_hat_enterprise_linux/8/"
BUG_REPORT_URL="https://bugzilla.redhat.com/"

REDHAT_BUGZILLA_PRODUCT="Red Hat Enterprise Linux 8"
REDHAT_BUGZILLA_PRODUCT_VERSION=8.4
REDHAT_SUPPORT_PRODUCT="Red Hat Enterprise Linux"
REDHAT_SUPPORT_PRODUCT_VERSION="8.4"
[root@rbruzzon etc]# 


# grep pacemaker installed-rpms
ansible-pacemaker-1.0.4-2.20210623224811.666f706.el8ost.noarch Wed Jul 27 09:20:05 2022
pacemaker-2.0.5-9.el8_4.5.x86_64                            Wed Jul 27 09:17:56 2022
pacemaker-cli-2.0.5-9.el8_4.5.x86_64                        Wed Jul 27 09:17:55 2022
pacemaker-cluster-libs-2.0.5-9.el8_4.5.x86_64               Wed Jul 27 09:17:56 2022
pacemaker-libs-2.0.5-9.el8_4.5.x86_64                       Wed Jul 27 09:17:55 2022
pacemaker-remote-2.0.5-9.el8_4.5.x86_64                     Wed Jul 27 09:18:54 2022
pacemaker-schemas-2.0.5-9.el8_4.5.noarch                    Wed Jul 27 09:17:55 2022
puppet-pacemaker-1.4.1-2.20220422005549.e0a869c.el8ost.noarch Tue Jan 24 15:39:11 2023

Comment 5 Reid Wahl 2023-02-09 22:17:38 UTC

The "watchdog is not eligible to fence <node>" part is reproducible as of pacemaker-2.1.4-5.el9_1.2.x86_64. The following code appears to add only the local node (and no other nodes) to device->targets, so that we can use the watchdog device only to fence ourselves and not to fence any other node.
  - https://github.com/ClusterLabs/pacemaker/blame/Pacemaker-2.1.4/daemons/fenced/fenced_commands.c#L1344-L1361

If that's intended behavior, then I'm not sure why.

With that being said, on my reproducer cluster (using newer packages), so far there's no real issue: yes, we're unable to use `watchdog` to fence the node, but the node is declared "fenced" anyway after stonith-watchdog-timeout expires. Then resources are recovered.

Maybe it's related to the particular timeouts in place, as proposed in chat, or perhaps it's an issue in an older version.

Comment 6 Klaus Wenninger 2023-02-13 13:32:37 UTC

(In reply to Reid Wahl from comment #5)

> 
> If that's intended behavior, then I'm not sure why.
> 

Behavior is intended - sort of.

If the node to be watchdog-fenced is alive it will then advertise capability to self-fence and - assuming there is no other
reason here to self fence like being unclean or without quorum - get the job to do so.
Problem I'm seeing is how timeout for a topology is evaluated.
If there are 2 levels the timeout will be derived to 2x the standard timeout. So if the fence-action on the
1st level times out we still have a standard-timeout for watchdog-fencing to take place.
This is what is observed when the node to be watchdog-fenced is available.
In case this node isn't available, timeout for the full topology is derived to just 1x standard timeout as
watchdog-fencing - as pointed out above - isn't reported as available by the node.
In this scenario all timeout is used up by the first level timing out and thus we haven't got enough time
left for the watchdog-timeout.

Haven't checked but this optimized timeout evaluation might have been introduced after playing with this
scenario initially - or first level timing out was actually never tested as we always had other errors - don't remember ...

The least intrusive way to cope with this would probably be to make topology overall-timeout calculation
watchdog-aware in a sense that it would still add one timeout for watchdog-fencing - even if reported not
available.
This would already correct the behavior.
When we entirely suppress the unavailable-message for watchdog-fencing we would loose information.
Thus I'd recommend tweaking it in a way that it becomes more obvious what is going on.

These special cases for watchdog-fencing are kind of ugly and thus alternative ideas are welcome.

Comment 9 Klaus Wenninger 2023-02-14 15:26:57 UTC

https://github.com/ClusterLabs/pacemaker/pull/3026

@nwahl: Do you think we still need the additional logs?

Comment 12 Reid Wahl 2023-02-21 08:19:02 UTC

(In reply to Klaus Wenninger from comment #9)
> https://github.com/ClusterLabs/pacemaker/pull/3026
> 
> @nwahl: Do you think we still need the additional logs?

It seems reproducible so probably not, but it's good to know that we have them now :)

Comment 24 errata-xmlrpc 2023-11-14 15:32:36 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (pacemaker bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2023:6970