2084327 – pcs fails to bulk add nodes to pacemaker cluster

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 2084327 - pcs fails to bulk add nodes to pacemaker cluster

Summary: pcs fails to bulk add nodes to pacemaker cluster

Keywords:
Status:	CLOSED DUPLICATE of bug 2111349
Alias:	None
Product:	Red Hat Enterprise Linux 9
Classification:	Red Hat
Component:	corosync
Sub Component:
Version:	9.1
Hardware:	x86_64
OS:	Linux
Priority:	low
Severity:	low
Target Milestone:	rc
Target Release:	---
Assignee:	Jan Friesse
QA Contact:	cluster-qe@redhat.com
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2022-05-11 21:14 UTC by Heinz Mauelshagen
Modified:	2022-07-29 08:48 UTC (History)
CC List:	10 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2022-07-29 08:47:54 UTC
Type:	Bug
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Issue Tracker	RHELPLAN-121779	0	None	None	None	2022-05-11 21:33:02 UTC

Description Heinz Mauelshagen 2022-05-11 21:14:59 UTC

Description of problem:
Adding a list of nodes to a cluster causes 
"Warning: rhel-9.1-n1: Unable to reload corosync configuration: Unable to reload corosync configuration: Could not reload configuration. Error CS_ERR_TRY_AGAIN"
for member nodes and
"Error: Unable to perform operation on any available node/host, therefore it is not possible to continue
Error: Errors have occurred, therefore pcs is unable to continue"

Version-Release number of selected component (if applicable):
pcs 0.11.1-10.el9, pacemaker 2.1.2-4.el9

How reproducible:
for n in n5 n6 n7 ...;do pcs cluster node add $n --enable --start;done

Steps to Reproduce:
1. create e.g. 3 node cluster
2. prepare additional (virtual) machines to add as nodes
3. run ^ shell loop

Actual results:
see above and in attachments.
Also, nodes have to be rebooted to recover.

Expected results:
bulk node addtions succeeding

Additional info:
see attachments

Comment 2 Heinz Mauelshagen 2022-05-12 12:31:50 UTC

As a result of node addition failing and even after fence reboot of problematic ones:

# pcs resource move xfs rhel-9.1-n7
Error: Unable to update cib
Call cib_apply_diff failed (-62): Timer expired

Comment 3 Heinz Mauelshagen 2022-05-12 12:39:57 UTC

Also, nodes hang on reboot.

Fond all corosync logs attached.

Comment 4 Heinz Mauelshagen 2022-05-12 12:42:00 UTC

Created attachment 1878914 [details]
corosync logs

Comment 5 Heinz Mauelshagen 2022-05-12 14:55:16 UTC

I assume this to be a race as

for n in rhel-9.1-n{4..15}; do pcs cluster node add $n --enable --start; sleep 10; done

works fine for the multiple node addition.

Afterwards, "pcs resource move $res $node" fails triggering fencing until after all nodes got rebooted though.

Comment 6 Tomas Jelinek 2022-05-24 08:52:48 UTC

I'm unable to reproduce this issue. The reproducer works perfectly fine for me with no hiccups.

There's a lot of "host: x has no active links", "PMTUD link change for host" and "Token has not been received in" messages in the attached logs. This may be a corosync, or more likely, a network issue.

Heinz:
Can you provide corosync.conf and CIB of the three-node cluster, i.e. before running the loop which adds nodes? Also, systemctl and corosync logs for "Unable to reload corosync configuration: Could not reload configuration. Error CS_ERR_TRY_AGAIN" from the node where the reload failed may be helpful.

Comment 12 Tomas Jelinek 2022-06-28 12:51:32 UTC

I created a 3-node cluster running a fence_xvm device, an IP address, a group consisting of LVM and filesystem resources, colocated the group with the IP, and set resource-stickiness to 1. Then I repeatedly added (and removed) 9 nodes in a loop as shown in comment 0. Even after several attempts, I was not able to trigger the issue.

Because the errors come from corosync, I'm moving this to corosync for further investigation.

Comment 13 Jan Friesse 2022-06-29 08:22:55 UTC

@Heinz:
There are too many "Token has not been received in XXXX ms" - so this is probably the reason for CS_ERR_TRY_AGAIN failure. Also there is no "Corosync main process was not scheduled ..." what means the problem is probably network related.

Is there any rate limiting or packet size limiting in the network? Are you able to reproduce the bug reliably? If so, are you able to reproduce it with less node (lets say 8/12/...)? If your machine allows it, could you try to test same scenario with 16 VM running on one hypervisor machine?

Comment 14 Jan Friesse 2022-07-29 08:47:54 UTC

After some time we've got more reports of similar behavior and I'm pretty sure the behavior you've experienced is not caused by corosync itself but rather by knet and it is same bug as bug 2111349 and support cases 03227186 and 03263874.

RHEL 9.1 should already include fixed knet 1.24 so 9.1 should work now - for older RHELs testing packages can be found:

Scratch build repos:
- RHEL 8.6.0 - http://brew-task-repos.usersys.redhat.com/repos/scratch/jfriesse/kronosnet/1.22/1.el8_6.jf1/kronosnet-1.22-1.el8_6.jf1-scratch.repo
- RHEL 8.7.0 - http://brew-task-repos.usersys.redhat.com/repos/scratch/jfriesse/kronosnet/1.22/1.el8.jf1/kronosnet-1.22-1.el8.jf1-scratch.repo
- RHEL 9.0.0 - http://brew-task-repos.usersys.redhat.com/repos/scratch/jfriesse/kronosnet/1.22/3.el9_0.jf1/kronosnet-1.22-3.el9_0.jf1-scratch.repo

Alternatively (if scratch build disappears for some reason) I've uploaded packages to https://honzaf.fedorapeople.org/knet-bz2111349/ (contains also rhpkg git patches). 

So closing this bug as duplicate of 2111349.

*** This bug has been marked as a duplicate of bug 2111349 ***

Note You need to log in before you can comment on or make changes to this bug.