Hide Forgot
@Chrissie, because you were testing this larger clusters I'm reassigning to you. I also believe it may be good start to QE try running test suite on 32-node cluster and report results.
Quite recently there was discussion on upstream list http://lists.clusterlabs.org/pipermail/users/2017-January/004764.html. It looks like corosync works just fine up to ~70 nodes, then receive buffer overfills with join messages. So 32 nodes should be doable without changing corosync code/defaults.
As tested and confirmed by Chrissie and Chris Mackowski, corosync seems to work just fine with 32-nodes as it is. So no patch is provided and this bug is used as a "test only".
A 32-node Pacemaker cluster was created without any problems. Used generatejob2 command to create the cluster: # /usr/local/bin/generatejob2.sh --nodes 32 -v 7 --beaker-reserve 1 --disks 1 --ip 1 setup --submit Snippet from the TESTOUT.log: ... [2019-06-12 14:33:55.770890] [setup] corosync + pacemaker configure on virt-051, virt-052, virt-053, virt-054, virt-055, virt-056, virt-057, virt-058, virt-059, virt-060, virt-061, virt-062, virt-063, virt-064, virt-065, virt-066, virt-067, virt-074, virt-077, virt-078, virt-079, virt-082, virt-083, virt-084, virt-085, virt-086, virt-087, virt-088, virt-089, virt-090, virt-091, virt-092 ... [2019-06-12 14:42:11.487585] [setup] success [2019-06-12 14:42:11.487744] [setup] Waiting for clvm lockspace on all nodes... [2019-06-12 14:42:17.061511] [setup] Stopping and disabling lvmetad... [2019-06-12 14:42:19.556535] <pass name="setup" id="setup" pid="19644" time="Wed Jun 12 14:42:19 2019 +0200" type="cmd" duration="521" /> [2019-06-12 14:42:19.556664] ------------------- Summary --------------------- [2019-06-12 14:42:19.556797] Testcase Result [2019-06-12 14:42:19.556884] -------- ------ [2019-06-12 14:42:19.556968] generic_setup PASS [2019-06-12 14:42:19.557051] setup PASS [2019-06-12 14:42:19.557131] ================================================= [2019-06-12 14:42:19.557175] Total Tests Run: 2 [2019-06-12 14:42:19.557220] Total PASS: 2 [2019-06-12 14:42:19.557264] Total FAIL: 0 [2019-06-12 14:42:19.557408] Total TIMEOUT: 0 [2019-06-12 14:42:19.557457] Total KILLED: 0 [2019-06-12 14:42:19.557503] Total STOPPED: 0 Verified for corosync-2.4.3-6.el7
The following have been tested to work: - create cluster with 32 nodes and separate fencing - create fifty separate Apache resources, move all of them to different node, disable them, remove them - recovery: kill pacemaker on fifteen nodes and watch cluster recovery - recovery: halt fifteen nodes and watch pacemker fence them, then wait for them to come back
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:2245