1797579 – Pacemaker fencer should retry failed meta-data commands

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1797579 - Pacemaker fencer should retry failed meta-data commands

Summary: Pacemaker fencer should retry failed meta-data commands

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 8
Classification:	Red Hat
Component:	pacemaker
Sub Component:
Version:	8.0
Hardware:	x86_64
OS:	Linux
Priority:	low
Severity:	medium
Target Milestone:	rc
Target Release:	8.5
Assignee:	Oyvind Albrigtsen
QA Contact:	cluster-qe@redhat.com
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	1844500 (view as bug list)
Depends On:
Blocks:	1992014
TreeView+	depends on / blocked

Reported:	2020-02-03 12:47 UTC by Takashi Kajinami
Modified:	2024-03-06 16:09 UTC (History)
CC List:	11 users (show)
Fixed In Version:	pacemaker-2.1.0-1.el8
Doc Type:	Bug Fix
Doc Text:	Cause: If a fence agent meta-data request fails, Pacemaker previously would consider that result final. While this would be the best response for a permanent issue such as a nonconforming agent, the failure could be transient, for example when Pacemaker is enabled at boot and heavy CPU and/or I/O load at boot causes the meta-data request to time out. Consequence: Pacemaker would not know if the fence agent supports unfencing or other optional features, and would act as if it did not, which could cause unfencing failures and other issues. Fix: If Pacemaker is unable to obtain fence agent meta-data on the first try, it will now periodically retry. Result: Although problems could still occur during a short time window, the meta-data will be obtained when feasible if the failure was transient, and the issues should be recoverable.
Clone Of:
Clones:	1992014 (view as bug list)
Environment:
Last Closed:	2021-11-09 18:44:49 UTC
Type:	Bug
Target Upstream Version:	2.1.0
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Knowledge Base (Solution)	6534651	0	None	None	None	2021-11-23 12:50:43 UTC
Red Hat Product Errata	RHEA-2021:4267	0	None	None	None	2021-11-09 18:45:25 UTC

Description Takashi Kajinami 2020-02-03 12:47:45 UTC

Description of problem:

When instance ha is enabled, we have fence_compute resource deployed,
which is used to mark nova-compute service on failed compute node as down.

We expect that fence_compute resource is used for compute node fialures,
but in fact the resource is also invoked to fence controller nodes.

In some cases, for example when one of the controller nodes with vip fails,
we see that fence_compute resource is invoked to fence that compute node,
and see the following problem caused by this invocation.
 - disabiling nova-compute failed
 - stop operation in fence_compute timed out, which causes UNCLEAN nodes


How reproducible:
We observe the problem once, but it should be always reproduced

Steps to Reproduce:
1. Deploy overcloud with instance ha enabled
2. Cause failure in the controller nodes where vip is running

Actual results:
fence_compute is invoked for the failed controller nodes

Expected results:
fence_compute is not invoked for the failed controller nodes

Additional info:

Comment 18 Ken Gaillot 2020-03-09 17:53:41 UTC

When a fence device is registered with Pacemaker's fencer, which happens when the cluster starts for devices already in the configuration, the fencer attempts to execute the fence agent's meta-data command. If this fails, the fencer is unable to know what the agent supports (such as the list or status commands for dynamic host lists), and so assumes no capabilities. This can cause problems (such as assuming the device can fence all hosts, rather than attempting to get a dynamic host list).

In this case, it appears the meta-data command timed out when pacemaker was started at host boot, and other processes were causing disk I/O slowness.

To be more resilient during such transient issues, a possible enhancement would be for the fencer to remember when a meta-data command failed, and retry it periodically.

RHEL 7 is no longer receiving such enhancements, so I am reassigning this bz to RHEL 8.

Comment 22 Ken Gaillot 2020-06-18 20:31:05 UTC

*** Bug 1844500 has been marked as a duplicate of this bug. ***

Comment 25 Ken Gaillot 2021-03-02 16:56:37 UTC

Fixed upstream as of commit 1d33712

Comment 38 errata-xmlrpc 2021-11-09 18:44:49 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (pacemaker bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2021:4267

Note You need to log in before you can comment on or make changes to this bug.