Bug 2084327

Summary:	pcs fails to bulk add nodes to pacemaker cluster
Product:	Red Hat Enterprise Linux 9	Reporter:	Heinz Mauelshagen <heinzm>
Component:	corosync	Assignee:	Jan Friesse <jfriesse>
Status:	CLOSED DUPLICATE	QA Contact:	cluster-qe <cluster-qe>
Severity:	low	Docs Contact:
Priority:	low
Version:	9.1	CC:	ccaulfie, cfeist, cluster-maint, idevat, jfriesse, mlisik, mpospisi, nwahl, omular, tojeline
Target Milestone:	rc
Target Release:	---
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2022-07-29 08:47:54 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Heinz Mauelshagen 2022-05-11 21:14:59 UTC

Description of problem:
Adding a list of nodes to a cluster causes 
"Warning: rhel-9.1-n1: Unable to reload corosync configuration: Unable to reload corosync configuration: Could not reload configuration. Error CS_ERR_TRY_AGAIN"
for member nodes and
"Error: Unable to perform operation on any available node/host, therefore it is not possible to continue
Error: Errors have occurred, therefore pcs is unable to continue"

Version-Release number of selected component (if applicable):
pcs 0.11.1-10.el9, pacemaker 2.1.2-4.el9

How reproducible:
for n in n5 n6 n7 ...;do pcs cluster node add $n --enable --start;done

Steps to Reproduce:
1. create e.g. 3 node cluster
2. prepare additional (virtual) machines to add as nodes
3. run ^ shell loop

Actual results:
see above and in attachments.
Also, nodes have to be rebooted to recover.

Expected results:
bulk node addtions succeeding

Additional info:
see attachments

Comment 2 Heinz Mauelshagen 2022-05-12 12:31:50 UTC

As a result of node addition failing and even after fence reboot of problematic ones:

# pcs resource move xfs rhel-9.1-n7
Error: Unable to update cib
Call cib_apply_diff failed (-62): Timer expired

Comment 3 Heinz Mauelshagen 2022-05-12 12:39:57 UTC

Also, nodes hang on reboot.

Fond all corosync logs attached.

Comment 4 Heinz Mauelshagen 2022-05-12 12:42:00 UTC

Created attachment 1878914 [details]
corosync logs

Comment 5 Heinz Mauelshagen 2022-05-12 14:55:16 UTC

I assume this to be a race as

for n in rhel-9.1-n{4..15}; do pcs cluster node add $n --enable --start; sleep 10; done

works fine for the multiple node addition.

Afterwards, "pcs resource move $res $node" fails triggering fencing until after all nodes got rebooted though.

Comment 6 Tomas Jelinek 2022-05-24 08:52:48 UTC

I'm unable to reproduce this issue. The reproducer works perfectly fine for me with no hiccups.

There's a lot of "host: x has no active links", "PMTUD link change for host" and "Token has not been received in" messages in the attached logs. This may be a corosync, or more likely, a network issue.

Heinz:
Can you provide corosync.conf and CIB of the three-node cluster, i.e. before running the loop which adds nodes? Also, systemctl and corosync logs for "Unable to reload corosync configuration: Could not reload configuration. Error CS_ERR_TRY_AGAIN" from the node where the reload failed may be helpful.

Comment 12 Tomas Jelinek 2022-06-28 12:51:32 UTC

I created a 3-node cluster running a fence_xvm device, an IP address, a group consisting of LVM and filesystem resources, colocated the group with the IP, and set resource-stickiness to 1. Then I repeatedly added (and removed) 9 nodes in a loop as shown in comment 0. Even after several attempts, I was not able to trigger the issue.

Because the errors come from corosync, I'm moving this to corosync for further investigation.

Comment 13 Jan Friesse 2022-06-29 08:22:55 UTC

@Heinz:
There are too many "Token has not been received in XXXX ms" - so this is probably the reason for CS_ERR_TRY_AGAIN failure. Also there is no "Corosync main process was not scheduled ..." what means the problem is probably network related.

Is there any rate limiting or packet size limiting in the network? Are you able to reproduce the bug reliably? If so, are you able to reproduce it with less node (lets say 8/12/...)? If your machine allows it, could you try to test same scenario with 16 VM running on one hypervisor machine?

Comment 14 Jan Friesse 2022-07-29 08:47:54 UTC

After some time we've got more reports of similar behavior and I'm pretty sure the behavior you've experienced is not caused by corosync itself but rather by knet and it is same bug as bug 2111349 and support cases 03227186 and 03263874.

RHEL 9.1 should already include fixed knet 1.24 so 9.1 should work now - for older RHELs testing packages can be found:

Scratch build repos:
- RHEL 8.6.0 - http://brew-task-repos.usersys.redhat.com/repos/scratch/jfriesse/kronosnet/1.22/1.el8_6.jf1/kronosnet-1.22-1.el8_6.jf1-scratch.repo
- RHEL 8.7.0 - http://brew-task-repos.usersys.redhat.com/repos/scratch/jfriesse/kronosnet/1.22/1.el8.jf1/kronosnet-1.22-1.el8.jf1-scratch.repo
- RHEL 9.0.0 - http://brew-task-repos.usersys.redhat.com/repos/scratch/jfriesse/kronosnet/1.22/3.el9_0.jf1/kronosnet-1.22-3.el9_0.jf1-scratch.repo

Alternatively (if scratch build disappears for some reason) I've uploaded packages to https://honzaf.fedorapeople.org/knet-bz2111349/ (contains also rhpkg git patches). 

So closing this bug as duplicate of 2111349.

*** This bug has been marked as a duplicate of bug 2111349 ***