2168633 – [BDI] Pacemaker resources left UNCLEAN after controller node failure

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 2168633 - [BDI] Pacemaker resources left UNCLEAN after controller node failure

Summary: [BDI] Pacemaker resources left UNCLEAN after controller node failure

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 8
Classification:	Red Hat
Component:	pacemaker
Sub Component:
Version:	8.4
Hardware:	x86_64
OS:	All
Priority:	high
Severity:	urgent
Target Milestone:	rc
Target Release:	8.9
Assignee:	Klaus Wenninger
QA Contact:	cluster-qe
Docs Contact:	Steven J. Levine
URL:
Whiteboard:
Depends On:
Blocks:	2182482 2187419 2187421 2187422 2187423
TreeView+	depends on / blocked

Reported:	2023-02-09 15:38 UTC by Riccardo Bruzzone
Modified:	2023-11-14 16:51 UTC (History)
CC List:	11 users (show)
Fixed In Version:	pacemaker-2.1.6-1.el8
Doc Type:	Bug Fix
Doc Text:	.A fence watchdog configured as a second fencing device now fences a node when the first device times out Previously, when a watchdog fencing device was configured as the second device in a fencing topology, the watchdog timeout would not be considered when calculating the timeout for the fencing operation. As a result, if the first device timed out the fencing operation would time out even though the watchdog would fence the node. With this fix, the watchdog timeout is included in the fencing operation timeout and the fencing operation succeeds if the first device times out.
Clone Of:
Clones:	2182482 2187419 2187421 2187422 2187423 (view as bug list)
Environment:
Last Closed:	2023-11-14 15:32:36 UTC
Type:	Bug
Target Upstream Version:	2.1.6
Embargoed:
Dependent Products:
Flags:	pm-rhel: mirror+

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Red Hat Issue Tracker	CLUSTERQE-6696	None	None	None	2023-05-16 13:10:24 UTC
Red Hat Issue Tracker	RHELPLAN-148183	None	None	None	2023-02-09 15:40:23 UTC
Red Hat Knowledge Base (Solution)	6997664	None	None	None	2023-02-09 16:18:02 UTC
Red Hat Product Errata	RHEA-2023:6970	None	None	None	2023-11-14 15:33:44 UTC

Description Riccardo Bruzzone 2023-02-09 15:38:21 UTC

Customer environment:

Red Hat OpenStack Platform based on RHOSP 16.2 Z4 (RHEL 8.4 [1])


Description of problem:

One of the controller nodes had a very serious hardware issue and the node shut itself down, Pacemaker tried to power it back on via its IPMI device but the BMC refused the power-on command.

At this point, all resources owned by the node transitioned into UNCLEAN and were left in that state even though the node has SBD as a second-level fence device defined. We had to manually confirm the STONITH action before Pacemaker brought the affected resources back online.
We're using a single virtual fence_watchdog resource attached to all three controllers, as directed by your support in a previous exchange.

The controller died on February 4th 21:08:06 CET and we performed a manual stonith confirm on February 6th 09:11:55.

What is the business impact? Please also provide time frame information.
pre-production environment, loss of API availability

Could you help to understand why the cluster wasn't able to recover from this HW issue ?
Is there any fix feasible to prevent this issue in the future ?


[1]

# cat os-release 
NAME="Red Hat Enterprise Linux"
VERSION="8.4 (Ootpa)"
ID="rhel"
ID_LIKE="fedora"
VERSION_ID="8.4"
PLATFORM_ID="platform:el8"
PRETTY_NAME="Red Hat Enterprise Linux 8.4 (Ootpa)"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:redhat:enterprise_linux:8.4:GA"
HOME_URL="https://www.redhat.com/"
DOCUMENTATION_URL="https://access.redhat.com/documentation/red_hat_enterprise_linux/8/"
BUG_REPORT_URL="https://bugzilla.redhat.com/"

REDHAT_BUGZILLA_PRODUCT="Red Hat Enterprise Linux 8"
REDHAT_BUGZILLA_PRODUCT_VERSION=8.4
REDHAT_SUPPORT_PRODUCT="Red Hat Enterprise Linux"
REDHAT_SUPPORT_PRODUCT_VERSION="8.4"
[root@rbruzzon etc]# 


# grep pacemaker installed-rpms
ansible-pacemaker-1.0.4-2.20210623224811.666f706.el8ost.noarch Wed Jul 27 09:20:05 2022
pacemaker-2.0.5-9.el8_4.5.x86_64                            Wed Jul 27 09:17:56 2022
pacemaker-cli-2.0.5-9.el8_4.5.x86_64                        Wed Jul 27 09:17:55 2022
pacemaker-cluster-libs-2.0.5-9.el8_4.5.x86_64               Wed Jul 27 09:17:56 2022
pacemaker-libs-2.0.5-9.el8_4.5.x86_64                       Wed Jul 27 09:17:55 2022
pacemaker-remote-2.0.5-9.el8_4.5.x86_64                     Wed Jul 27 09:18:54 2022
pacemaker-schemas-2.0.5-9.el8_4.5.noarch                    Wed Jul 27 09:17:55 2022
puppet-pacemaker-1.4.1-2.20220422005549.e0a869c.el8ost.noarch Tue Jan 24 15:39:11 2023

Comment 5 Reid Wahl 2023-02-09 22:17:38 UTC

The "watchdog is not eligible to fence <node>" part is reproducible as of pacemaker-2.1.4-5.el9_1.2.x86_64. The following code appears to add only the local node (and no other nodes) to device->targets, so that we can use the watchdog device only to fence ourselves and not to fence any other node.
  - https://github.com/ClusterLabs/pacemaker/blame/Pacemaker-2.1.4/daemons/fenced/fenced_commands.c#L1344-L1361

If that's intended behavior, then I'm not sure why.

With that being said, on my reproducer cluster (using newer packages), so far there's no real issue: yes, we're unable to use `watchdog` to fence the node, but the node is declared "fenced" anyway after stonith-watchdog-timeout expires. Then resources are recovered.

Maybe it's related to the particular timeouts in place, as proposed in chat, or perhaps it's an issue in an older version.

Comment 6 Klaus Wenninger 2023-02-13 13:32:37 UTC

(In reply to Reid Wahl from comment #5)

> 
> If that's intended behavior, then I'm not sure why.
> 

Behavior is intended - sort of.

If the node to be watchdog-fenced is alive it will then advertise capability to self-fence and - assuming there is no other
reason here to self fence like being unclean or without quorum - get the job to do so.
Problem I'm seeing is how timeout for a topology is evaluated.
If there are 2 levels the timeout will be derived to 2x the standard timeout. So if the fence-action on the
1st level times out we still have a standard-timeout for watchdog-fencing to take place.
This is what is observed when the node to be watchdog-fenced is available.
In case this node isn't available, timeout for the full topology is derived to just 1x standard timeout as
watchdog-fencing - as pointed out above - isn't reported as available by the node.
In this scenario all timeout is used up by the first level timing out and thus we haven't got enough time
left for the watchdog-timeout.

Haven't checked but this optimized timeout evaluation might have been introduced after playing with this
scenario initially - or first level timing out was actually never tested as we always had other errors - don't remember ...

The least intrusive way to cope with this would probably be to make topology overall-timeout calculation
watchdog-aware in a sense that it would still add one timeout for watchdog-fencing - even if reported not
available.
This would already correct the behavior.
When we entirely suppress the unavailable-message for watchdog-fencing we would loose information.
Thus I'd recommend tweaking it in a way that it becomes more obvious what is going on.

These special cases for watchdog-fencing are kind of ugly and thus alternative ideas are welcome.

Comment 9 Klaus Wenninger 2023-02-14 15:26:57 UTC

https://github.com/ClusterLabs/pacemaker/pull/3026

@nwahl: Do you think we still need the additional logs?

Comment 12 Reid Wahl 2023-02-21 08:19:02 UTC

(In reply to Klaus Wenninger from comment #9)
> https://github.com/ClusterLabs/pacemaker/pull/3026
> 
> @nwahl: Do you think we still need the additional logs?

It seems reproducible so probably not, but it's good to know that we have them now :)

Comment 24 errata-xmlrpc 2023-11-14 15:32:36 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (pacemaker bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2023:6970

Note You need to log in before you can comment on or make changes to this bug.