Bug 2084327
Summary: | pcs fails to bulk add nodes to pacemaker cluster | ||
---|---|---|---|
Product: | Red Hat Enterprise Linux 9 | Reporter: | Heinz Mauelshagen <heinzm> |
Component: | corosync | Assignee: | Jan Friesse <jfriesse> |
Status: | CLOSED DUPLICATE | QA Contact: | cluster-qe <cluster-qe> |
Severity: | low | Docs Contact: | |
Priority: | low | ||
Version: | 9.1 | CC: | ccaulfie, cfeist, cluster-maint, idevat, jfriesse, mlisik, mpospisi, nwahl, omular, tojeline |
Target Milestone: | rc | ||
Target Release: | --- | ||
Hardware: | x86_64 | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2022-07-29 08:47:54 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Heinz Mauelshagen
2022-05-11 21:14:59 UTC
As a result of node addition failing and even after fence reboot of problematic ones: # pcs resource move xfs rhel-9.1-n7 Error: Unable to update cib Call cib_apply_diff failed (-62): Timer expired Also, nodes hang on reboot. Fond all corosync logs attached. Created attachment 1878914 [details]
corosync logs
I assume this to be a race as for n in rhel-9.1-n{4..15}; do pcs cluster node add $n --enable --start; sleep 10; done works fine for the multiple node addition. Afterwards, "pcs resource move $res $node" fails triggering fencing until after all nodes got rebooted though. I'm unable to reproduce this issue. The reproducer works perfectly fine for me with no hiccups. There's a lot of "host: x has no active links", "PMTUD link change for host" and "Token has not been received in" messages in the attached logs. This may be a corosync, or more likely, a network issue. Heinz: Can you provide corosync.conf and CIB of the three-node cluster, i.e. before running the loop which adds nodes? Also, systemctl and corosync logs for "Unable to reload corosync configuration: Could not reload configuration. Error CS_ERR_TRY_AGAIN" from the node where the reload failed may be helpful. I created a 3-node cluster running a fence_xvm device, an IP address, a group consisting of LVM and filesystem resources, colocated the group with the IP, and set resource-stickiness to 1. Then I repeatedly added (and removed) 9 nodes in a loop as shown in comment 0. Even after several attempts, I was not able to trigger the issue. Because the errors come from corosync, I'm moving this to corosync for further investigation. @Heinz: There are too many "Token has not been received in XXXX ms" - so this is probably the reason for CS_ERR_TRY_AGAIN failure. Also there is no "Corosync main process was not scheduled ..." what means the problem is probably network related. Is there any rate limiting or packet size limiting in the network? Are you able to reproduce the bug reliably? If so, are you able to reproduce it with less node (lets say 8/12/...)? If your machine allows it, could you try to test same scenario with 16 VM running on one hypervisor machine? After some time we've got more reports of similar behavior and I'm pretty sure the behavior you've experienced is not caused by corosync itself but rather by knet and it is same bug as bug 2111349 and support cases 03227186 and 03263874. RHEL 9.1 should already include fixed knet 1.24 so 9.1 should work now - for older RHELs testing packages can be found: Scratch build repos: - RHEL 8.6.0 - http://brew-task-repos.usersys.redhat.com/repos/scratch/jfriesse/kronosnet/1.22/1.el8_6.jf1/kronosnet-1.22-1.el8_6.jf1-scratch.repo - RHEL 8.7.0 - http://brew-task-repos.usersys.redhat.com/repos/scratch/jfriesse/kronosnet/1.22/1.el8.jf1/kronosnet-1.22-1.el8.jf1-scratch.repo - RHEL 9.0.0 - http://brew-task-repos.usersys.redhat.com/repos/scratch/jfriesse/kronosnet/1.22/3.el9_0.jf1/kronosnet-1.22-3.el9_0.jf1-scratch.repo Alternatively (if scratch build disappears for some reason) I've uploaded packages to https://honzaf.fedorapeople.org/knet-bz2111349/ (contains also rhpkg git patches). So closing this bug as duplicate of 2111349. *** This bug has been marked as a duplicate of bug 2111349 *** |