Bug 1374857

Summary: [RFE] Support of 32-node Pacemaker cluster
Product: Red Hat Enterprise Linux 7 Reporter: Sam Yangsao <syangsao>
Component: corosyncAssignee: Christine Caulfield <ccaulfie>
Status: CLOSED ERRATA QA Contact: cluster-qe <cluster-qe>
Severity: high Docs Contact: Steven J. Levine <slevine>
Priority: medium    
Version: 7.4CC: aarnold, aherr, ccaulfie, cfeist, cluster-maint, cluster-qe, cmackows, djansa, dwood, fdanapfe, fdinitto, idevat, jfriesse, kgaillot, mmazoure, mnovacek, omular, royoung, sbradley, slevine, syangsao, syu, tojeline
Target Milestone: rcKeywords: FutureFeature
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Enhancement
Doc Text:
.Maximum size of a supported RHEL HA cluster increased from 16 to 32 nodes With this release, Red Hat supports cluster deployments of up to 32 full cluster nodes.
Story Points: ---
Clone Of:
: 1717098 (view as bug list) Environment:
Last Closed: 2019-08-06 13:10:11 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1298243    
Bug Blocks: 1363902, 1420851, 1717098, 1722048    

Comment 6 Jan Friesse 2016-09-13 07:13:21 UTC
@Chrissie,
because you were testing this larger clusters I'm reassigning to you.

I also believe it may be good start to QE try running test suite on 32-node cluster and report results.

Comment 16 Jan Friesse 2017-01-16 12:19:41 UTC
Quite recently there was discussion on upstream list http://lists.clusterlabs.org/pipermail/users/2017-January/004764.html. It looks like corosync works just fine up to ~70 nodes, then receive buffer overfills with join messages.

So 32 nodes should be doable without changing corosync code/defaults.

Comment 39 Jan Friesse 2019-03-22 08:02:49 UTC
As tested and confirmed by Chrissie and Chris Mackowski, corosync seems to work just fine with 32-nodes as it is. So no patch is provided and this bug is used as a "test only".

Comment 40 Michal Mazourek 2019-06-12 13:10:24 UTC
A 32-node Pacemaker cluster was created without any problems.
Used generatejob2 command to create the cluster:
# /usr/local/bin/generatejob2.sh --nodes 32 -v 7 --beaker-reserve 1 --disks 1 --ip 1 setup --submit

Snippet from the TESTOUT.log:
...
[2019-06-12 14:33:55.770890] [setup] corosync + pacemaker configure on virt-051, virt-052, virt-053, virt-054, virt-055, virt-056, virt-057, virt-058, virt-059, virt-060, virt-061, virt-062, virt-063, virt-064, virt-065, virt-066, virt-067, virt-074, virt-077, virt-078, virt-079, virt-082, virt-083, virt-084, virt-085, virt-086, virt-087, virt-088, virt-089, virt-090, virt-091, virt-092
...
[2019-06-12 14:42:11.487585] [setup]  success
[2019-06-12 14:42:11.487744] [setup] Waiting for clvm lockspace on all nodes...
[2019-06-12 14:42:17.061511] [setup] Stopping and disabling lvmetad...
[2019-06-12 14:42:19.556535] <pass name="setup" id="setup" pid="19644" time="Wed Jun 12 14:42:19 2019 +0200" type="cmd" duration="521" />
[2019-06-12 14:42:19.556664] ------------------- Summary ---------------------
[2019-06-12 14:42:19.556797] Testcase                                 Result    
[2019-06-12 14:42:19.556884] --------                                 ------    
[2019-06-12 14:42:19.556968] generic_setup                            PASS      
[2019-06-12 14:42:19.557051] setup                                    PASS      
[2019-06-12 14:42:19.557131] =================================================
[2019-06-12 14:42:19.557175] Total Tests Run: 2
[2019-06-12 14:42:19.557220] Total PASS:      2
[2019-06-12 14:42:19.557264] Total FAIL:      0
[2019-06-12 14:42:19.557408] Total TIMEOUT:   0
[2019-06-12 14:42:19.557457] Total KILLED:    0
[2019-06-12 14:42:19.557503] Total STOPPED:   0

Verified for corosync-2.4.3-6.el7

Comment 43 michal novacek 2019-07-10 17:11:03 UTC

The following have been tested to work:

 - create cluster with 32 nodes and separate fencing 

 - create fifty separate Apache resources, move all of them to different node, disable them, remove them

 - recovery: kill pacemaker on fifteen nodes and watch cluster recovery

 - recovery: halt fifteen nodes and watch pacemker fence them, then wait for them to come back

Comment 48 errata-xmlrpc 2019-08-06 13:10:11 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:2245