Bug 2125587

Summary: During a rolling upgrade, monitor operations are not being communicated between nodes as expected. [rhel-8.6.0.z]
Product: Red Hat Enterprise Linux 8 Reporter: RHEL Program Management Team <pgm-rhel-tools>
Component: pacemakerAssignee: Ken Gaillot <kgaillot>
Status: CLOSED ERRATA QA Contact: cluster-qe <cluster-qe>
Severity: urgent Docs Contact:
Priority: urgent    
Version: 8.6CC: cfeist, cluster-maint, jobaker, mjuricek, nwahl, sbradley
Target Milestone: rcKeywords: Regression, Triaged, ZStream
Target Release: ---   
Hardware: Unspecified   
OS: Linux   
Whiteboard:
Fixed In Version: pacemaker-2.1.2-4.el8_6.5 Doc Type: Bug Fix
Doc Text:
Cause: OCF resource agent metadata actions block the controller, and crm_node queries now perform controller requests. Consequence: If an agent's metadata action calls crm_node, it will completely block the controller for 30 seconds until the action times out, possibly causing other actions to fail and the node to be fenced. Fix: The controller now performs metadata actions asynchronously. Result: Agent metadata actions can call crm_node without problems.
Story Points: ---
Clone Of: 2121852 Environment:
Last Closed: 2022-12-06 09:54:28 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version: 2.1.5
Embargoed:
Bug Depends On: 2121852    
Bug Blocks:    

Comment 1 Ken Gaillot 2022-09-20 19:51:25 UTC
Fixed in upstream main branch as of commit bc852fe3

Comment 2 Chris Lumens 2022-10-12 14:59:21 UTC
Steps to reproduce are in https://bugzilla.redhat.com/show_bug.cgi?id=2121852#c12.  Make sure to modify the Dummy resource agent on the same node that it will execute on (so, either have just a one node cluster or modify it on all nodes to be safe).  Start the cluster and add the Dummy resource.  You'll see in the logs that it takes a while before it times out and logs the error message, but the resource will still be created.  Stop the cluster and start it again with the new resource.  You'll see that it takes a long time to start up and that "crm_node -l" is in the process list.

Update to the new packages.  Start the cluster again and you'll see it start up normally.

Comment 14 errata-xmlrpc 2022-12-06 09:54:28 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (pacemaker bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2022:8808