Bug 2125344

Summary: During a rolling upgrade, monitor operations are not being communicated between nodes as expected.
Product: Red Hat Enterprise Linux 9 Reporter: Ken Gaillot <kgaillot>
Component: pacemakerAssignee: Ken Gaillot <kgaillot>
Status: CLOSED ERRATA QA Contact: cluster-qe <cluster-qe>
Severity: urgent Docs Contact: Steven J. Levine <slevine>
Priority: urgent    
Version: 9.1CC: cfeist, cluster-maint, cluster-qe, cnewsom, jobaker, jrehova, mjuricek, nwahl, sbradley, slevine
Target Milestone: rcKeywords: Regression, Triaged, ZStream
Target Release: 9.2Flags: pm-rhel: mirror+
Hardware: All   
OS: All   
Whiteboard:
Fixed In Version: pacemaker-2.1.5-1.el9 Doc Type: Bug Fix
Doc Text:
.OCF resource agent metadata actions can now call `crm_node` without causing unexpected fencing As of RHEL 8.5, OCF resource agent metadata actions blocked the controller and `crm_node` queries performed controller requests. As a result, if an agent's metadata action called `crm_node`, it blocked the controller for 30 seconds until the action timed out. This could cause other actions to fail and the node to be fenced. With this fix, the controller now performs metadata actions asynchronously. An OCF resource agent metadata action can now call `crm_node` without issue.
Story Points: ---
Clone Of: 2121852
: 2128036 2128037 (view as bug list) Environment:
Last Closed: 2023-05-09 07:18:17 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version: 2.1.5
Embargoed:
Bug Depends On: 2121852    
Bug Blocks: 2128036, 2128037    

Description Ken Gaillot 2022-09-08 16:53:15 UTC
+++ This bug was initially created as a clone of Bug #2121852 +++

Description of problem:

During a rolling upgrade, when the updated node rejoins, the non-updated node logs `Node nodeA did not send monitor result (via controller) within 120000ms (action timeout plus cluster-delay)` this results in monitor fail > stop fail > fence.


Aug 26 14:32:35 nodeB pacemaker-controld[50166]: error: Node nodeA did not send monitor result (via controller) within 120000ms (action timeout plus cluster-delay)
Aug 26 14:32:35 nodeB pacemaker-controld[50166]: error: [Action   55]: In-flight resource op lsnr_pruledbprd_monitor_0    on nodeA (priority: 0, waiting: (null))
Aug 26 14:32:35 nodeB pacemaker-controld[50166]: warning: rsc_op 55: lsnr_pruledbprd_monitor_0 on nodeA timed out



It seems that there is an issue in which nodeB is not receiving monitor results from nodeA within and expected timeframe, which is causing resource failures that are leading to fence events. This appears to be a bug. Based on the timing of the log messages. Perhaps there is a difference in how monitor operations are scheduled between pacemaker-2.1.0-8.el8 and pacemaker-2.1.2-4.el8_6.2.

For example, at 14:30:35, node B initiates probes of all the resources on node A. That's when the timer starts counting down. But node A runs them serially. For example, for this 30-second window, it's only running ONE of the probes:

Aug 26 14:30:35 nodeA pacemaker-controld[47747]: notice: Requesting local execution of probe operation for srdf_pruledb_PRD on nodeA
Aug 26 14:31:05 nodeA pacemaker-controld[47747]: error: Failed to retrieve meta-data for ocf:sky:emc_srdf

It fails to get metadata for several of these resources, and each of them takes 30 seconds to time out. So it never even tries to run the lsnr probe (the resource that ultimately caused the failure message on node B) within the 120 seconds allowed for node B to receive the result 'action timeout plus cluster delay' means the 60-second action timeout plus an additional 60-second (by default) grace period that we have to receive the result after the timeout expires == 120 seconds



Version-Release number of selected component (if applicable):

Node not receiving the monitor result is on pacemaker-2.1.0-8.el8

Updated node is on pacemaker-2.1.2-4.el8_6.2

<snip>

--- Additional comment from Ken Gaillot on 2022-09-07 19:40:09 UTC ---

No further information is needed from the customer.

Since upstream Pacemaker release 2.1.0 (RHEL 8.5 and 9.0 beta), crm_node -l gets the node list via the controller, rather than querying Corosync directly.

Since metadata calls are executed synchronously in the controller, metadata calls block all other activity by the controller. Thus, if the metadata action runs crm_node, crm_node will be unable to connect to the controller, and the metadata action will hang until it times out.

The long-term solution in Bug 1850506 will be better, but for a quicker fix, I'm thinking we can make the metadata call asynchronous without the rest of that solution. The metadata call would still eat into the probe timeout, but since the controller could handle the crm_node connection, it can proceed and should complete quickly.

--- Additional comment from Ken Gaillot on 2022-09-07 19:48:31 UTC ---

A workaround would be to avoid calling crm_node for the metadata action.

Comment 2 Ken Gaillot 2022-09-08 16:58:49 UTC
QA: Reproducer:

* Configure a cluster.
* Edit or copy /usr/lib/ocf/resource.d/pacemaker/Dummy and add "crm_node -l" to the metadata action.
* Create a resource using Dummy (or the copy).

Actual results: The cluster is hung for 30 seconds and then the controller logs a metadata action timeout.

Expected results: The cluster proceeds normally.

Comment 5 Ken Gaillot 2022-09-20 19:51:38 UTC
Fixed in upstream main branch as of commit bc852fe3

Comment 11 jrehova 2022-12-05 15:43:30 UTC
* 2 nodes cluster

Version of pacemaker:

> [root@virt-261 ~]# rpm -q pacemaker
> pacemaker-2.1.5-2.el9.x86_64
 
Edited /usr/lib/ocf/resource.d/pacemaker/Dummy --> add "crm_node -l >/dev/null 2>&1":

> [root@virt-261 pacemaker]# vim Dummy 
> [root@virt-261 pacemaker]# grep -A1 '^meta_data()' /usr/lib/ocf/resource.d/pacemaker/Dummy
> meta_data() {
>     crm_node -l >/dev/null 2>&1

Created pcs resource:

> [root@virt-261 log]# date && pcs resource create rsc_dummy ocf:pacemaker:Dummy
> Mon Dec  5 04:33:04 PM CET 2022

Result --> the cluster proceeds normally (/var/log/messages):

> [root@virt-261 log]# grep -A2 'rsc_dummy' ./messages
> Dec  5 16:33:06 virt-261 pacemaker-controld[56224]: notice: Requesting local execution of probe operation for rsc_dummy on virt-261
> Dec  5 16:33:06 virt-261 pacemaker-controld[56224]: notice: Result of probe operation for rsc_dummy on virt-261: not running
> Dec  5 16:33:06 virt-261 pacemaker-controld[56224]: notice: Requesting local execution of start operation for rsc_dummy on virt-261
> Dec  5 16:33:06 virt-261 pacemaker-controld[56224]: notice: Result of start operation for rsc_dummy on virt-261: ok
> Dec  5 16:33:06 virt-261 pacemaker-controld[56224]: notice: Requesting local execution of monitor operation for rsc_dummy on virt-261
> Dec  5 16:33:06 virt-261 pacemaker-controld[56224]: notice: Result of monitor operation for rsc_dummy on virt-261: ok

> [root@virt-261 log]# pcs resource
>   * rsc_dummy	(ocf:pacemaker:Dummy):	 Started virt-261

Comment 14 errata-xmlrpc 2023-05-09 07:18:17 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (pacemaker bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2023:2150