2125344 – During a rolling upgrade, monitor operations are not being communicated between nodes as expected.

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 2125344 - During a rolling upgrade, monitor operations are not being communicated between nodes as expected.

Summary: During a rolling upgrade, monitor operations are not being communicated betwe...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 9
Classification:	Red Hat
Component:	pacemaker
Sub Component:
Version:	9.1
Hardware:	All
OS:	All
Priority:	urgent
Severity:	urgent
Target Milestone:	rc
Target Release:	9.2
Assignee:	Ken Gaillot
QA Contact:	cluster-qe@redhat.com
Docs Contact:	Steven J. Levine
URL:
Whiteboard:
Depends On:	2121852
Blocks:	2128036 2128037
TreeView+	depends on / blocked

Reported:	2022-09-08 16:53 UTC by Ken Gaillot
Modified:	2023-05-09 07:38 UTC (History)
CC List:	10 users (show)
Fixed In Version:	pacemaker-2.1.5-1.el9
Doc Type:	Bug Fix
Doc Text:	.OCF resource agent metadata actions can now call `crm_node` without causing unexpected fencing As of RHEL 8.5, OCF resource agent metadata actions blocked the controller and `crm_node` queries performed controller requests. As a result, if an agent's metadata action called `crm_node`, it blocked the controller for 30 seconds until the action timed out. This could cause other actions to fail and the node to be fenced. With this fix, the controller now performs metadata actions asynchronously. An OCF resource agent metadata action can now call `crm_node` without issue.
Clone Of:	2121852
Clones:	2128036 2128037 (view as bug list)
Environment:
Last Closed:	2023-05-09 07:18:17 UTC
Type:	Bug
Target Upstream Version:	2.1.5
Embargoed:
Dependent Products:
Flags:	pm-rhel: mirror+

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Red Hat Issue Tracker	CLUSTERQE-6229	None	None	None	2022-12-01 17:22:42 UTC
Red Hat Issue Tracker	RHELPLAN-133568	None	None	None	2022-09-08 17:02:44 UTC
Red Hat Knowledge Base (Solution)	6973540	None	None	None	2022-09-08 18:30:10 UTC
Red Hat Product Errata	RHBA-2023:2150	None	None	None	2023-05-09 07:18:40 UTC

Description Ken Gaillot 2022-09-08 16:53:15 UTC

+++ This bug was initially created as a clone of Bug #2121852 +++

Description of problem:

During a rolling upgrade, when the updated node rejoins, the non-updated node logs `Node nodeA did not send monitor result (via controller) within 120000ms (action timeout plus cluster-delay)` this results in monitor fail > stop fail > fence.


Aug 26 14:32:35 nodeB pacemaker-controld[50166]: error: Node nodeA did not send monitor result (via controller) within 120000ms (action timeout plus cluster-delay)
Aug 26 14:32:35 nodeB pacemaker-controld[50166]: error: [Action   55]: In-flight resource op lsnr_pruledbprd_monitor_0    on nodeA (priority: 0, waiting: (null))
Aug 26 14:32:35 nodeB pacemaker-controld[50166]: warning: rsc_op 55: lsnr_pruledbprd_monitor_0 on nodeA timed out



It seems that there is an issue in which nodeB is not receiving monitor results from nodeA within and expected timeframe, which is causing resource failures that are leading to fence events. This appears to be a bug. Based on the timing of the log messages. Perhaps there is a difference in how monitor operations are scheduled between pacemaker-2.1.0-8.el8 and pacemaker-2.1.2-4.el8_6.2.

For example, at 14:30:35, node B initiates probes of all the resources on node A. That's when the timer starts counting down. But node A runs them serially. For example, for this 30-second window, it's only running ONE of the probes:

Aug 26 14:30:35 nodeA pacemaker-controld[47747]: notice: Requesting local execution of probe operation for srdf_pruledb_PRD on nodeA
Aug 26 14:31:05 nodeA pacemaker-controld[47747]: error: Failed to retrieve meta-data for ocf:sky:emc_srdf

It fails to get metadata for several of these resources, and each of them takes 30 seconds to time out. So it never even tries to run the lsnr probe (the resource that ultimately caused the failure message on node B) within the 120 seconds allowed for node B to receive the result 'action timeout plus cluster delay' means the 60-second action timeout plus an additional 60-second (by default) grace period that we have to receive the result after the timeout expires == 120 seconds



Version-Release number of selected component (if applicable):

Node not receiving the monitor result is on pacemaker-2.1.0-8.el8

Updated node is on pacemaker-2.1.2-4.el8_6.2

<snip>

--- Additional comment from Ken Gaillot on 2022-09-07 19:40:09 UTC ---

No further information is needed from the customer.

Since upstream Pacemaker release 2.1.0 (RHEL 8.5 and 9.0 beta), crm_node -l gets the node list via the controller, rather than querying Corosync directly.

Since metadata calls are executed synchronously in the controller, metadata calls block all other activity by the controller. Thus, if the metadata action runs crm_node, crm_node will be unable to connect to the controller, and the metadata action will hang until it times out.

The long-term solution in Bug 1850506 will be better, but for a quicker fix, I'm thinking we can make the metadata call asynchronous without the rest of that solution. The metadata call would still eat into the probe timeout, but since the controller could handle the crm_node connection, it can proceed and should complete quickly.

--- Additional comment from Ken Gaillot on 2022-09-07 19:48:31 UTC ---

A workaround would be to avoid calling crm_node for the metadata action.

Comment 2 Ken Gaillot 2022-09-08 16:58:49 UTC

QA: Reproducer:

* Configure a cluster.
* Edit or copy /usr/lib/ocf/resource.d/pacemaker/Dummy and add "crm_node -l" to the metadata action.
* Create a resource using Dummy (or the copy).

Actual results: The cluster is hung for 30 seconds and then the controller logs a metadata action timeout.

Expected results: The cluster proceeds normally.

Comment 5 Ken Gaillot 2022-09-20 19:51:38 UTC

Fixed in upstream main branch as of commit bc852fe3

Comment 11 jrehova 2022-12-05 15:43:30 UTC

* 2 nodes cluster

Version of pacemaker:

> [root@virt-261 ~]# rpm -q pacemaker
> pacemaker-2.1.5-2.el9.x86_64
 
Edited /usr/lib/ocf/resource.d/pacemaker/Dummy --> add "crm_node -l >/dev/null 2>&1":

> [root@virt-261 pacemaker]# vim Dummy 
> [root@virt-261 pacemaker]# grep -A1 '^meta_data()' /usr/lib/ocf/resource.d/pacemaker/Dummy
> meta_data() {
>     crm_node -l >/dev/null 2>&1

Created pcs resource:

> [root@virt-261 log]# date && pcs resource create rsc_dummy ocf:pacemaker:Dummy
> Mon Dec  5 04:33:04 PM CET 2022

Result --> the cluster proceeds normally (/var/log/messages):

> [root@virt-261 log]# grep -A2 'rsc_dummy' ./messages
> Dec  5 16:33:06 virt-261 pacemaker-controld[56224]: notice: Requesting local execution of probe operation for rsc_dummy on virt-261
> Dec  5 16:33:06 virt-261 pacemaker-controld[56224]: notice: Result of probe operation for rsc_dummy on virt-261: not running
> Dec  5 16:33:06 virt-261 pacemaker-controld[56224]: notice: Requesting local execution of start operation for rsc_dummy on virt-261
> Dec  5 16:33:06 virt-261 pacemaker-controld[56224]: notice: Result of start operation for rsc_dummy on virt-261: ok
> Dec  5 16:33:06 virt-261 pacemaker-controld[56224]: notice: Requesting local execution of monitor operation for rsc_dummy on virt-261
> Dec  5 16:33:06 virt-261 pacemaker-controld[56224]: notice: Result of monitor operation for rsc_dummy on virt-261: ok

> [root@virt-261 log]# pcs resource
>   * rsc_dummy	(ocf:pacemaker:Dummy):	 Started virt-261

Comment 14 errata-xmlrpc 2023-05-09 07:18:17 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (pacemaker bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2023:2150

Note You need to log in before you can comment on or make changes to this bug.