Bug 1374857

Summary:	[RFE] Support of 32-node Pacemaker cluster
Product:	Red Hat Enterprise Linux 7	Reporter:	Sam Yangsao <syangsao>
Component:	corosync	Assignee:	Christine Caulfield <ccaulfie>
Status:	CLOSED ERRATA	QA Contact:	cluster-qe <cluster-qe>
Severity:	high	Docs Contact:	Steven J. Levine <slevine>
Priority:	medium
Version:	7.4	CC:	aarnold, aherr, ccaulfie, cfeist, cluster-maint, cluster-qe, cmackows, djansa, dwood, fdanapfe, fdinitto, idevat, jfriesse, kgaillot, mmazoure, mnovacek, omular, royoung, sbradley, slevine, syangsao, syu, tojeline
Target Milestone:	rc	Keywords:	FutureFeature
Target Release:	---
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	Enhancement
Doc Text:	.Maximum size of a supported RHEL HA cluster increased from 16 to 32 nodes With this release, Red Hat supports cluster deployments of up to 32 full cluster nodes.	Story Points:	---
Clone Of:
Clones:	1717098 (view as bug list)		Environment:
Last Closed:	2019-08-06 13:10:11 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	1298243
Bug Blocks:	1363902, 1420851, 1717098, 1722048

Comment 6 Jan Friesse 2016-09-13 07:13:21 UTC

@Chrissie,
because you were testing this larger clusters I'm reassigning to you.

I also believe it may be good start to QE try running test suite on 32-node cluster and report results.

Comment 16 Jan Friesse 2017-01-16 12:19:41 UTC

Quite recently there was discussion on upstream list http://lists.clusterlabs.org/pipermail/users/2017-January/004764.html. It looks like corosync works just fine up to ~70 nodes, then receive buffer overfills with join messages.

So 32 nodes should be doable without changing corosync code/defaults.

Comment 39 Jan Friesse 2019-03-22 08:02:49 UTC

As tested and confirmed by Chrissie and Chris Mackowski, corosync seems to work just fine with 32-nodes as it is. So no patch is provided and this bug is used as a "test only".

Comment 40 Michal Mazourek 2019-06-12 13:10:24 UTC

A 32-node Pacemaker cluster was created without any problems.
Used generatejob2 command to create the cluster:
# /usr/local/bin/generatejob2.sh --nodes 32 -v 7 --beaker-reserve 1 --disks 1 --ip 1 setup --submit

Snippet from the TESTOUT.log:
...
[2019-06-12 14:33:55.770890] [setup] corosync + pacemaker configure on virt-051, virt-052, virt-053, virt-054, virt-055, virt-056, virt-057, virt-058, virt-059, virt-060, virt-061, virt-062, virt-063, virt-064, virt-065, virt-066, virt-067, virt-074, virt-077, virt-078, virt-079, virt-082, virt-083, virt-084, virt-085, virt-086, virt-087, virt-088, virt-089, virt-090, virt-091, virt-092
...
[2019-06-12 14:42:11.487585] [setup]  success
[2019-06-12 14:42:11.487744] [setup] Waiting for clvm lockspace on all nodes...
[2019-06-12 14:42:17.061511] [setup] Stopping and disabling lvmetad...
[2019-06-12 14:42:19.556535] <pass name="setup" id="setup" pid="19644" time="Wed Jun 12 14:42:19 2019 +0200" type="cmd" duration="521" />
[2019-06-12 14:42:19.556664] ------------------- Summary ---------------------
[2019-06-12 14:42:19.556797] Testcase                                 Result    
[2019-06-12 14:42:19.556884] --------                                 ------    
[2019-06-12 14:42:19.556968] generic_setup                            PASS      
[2019-06-12 14:42:19.557051] setup                                    PASS      
[2019-06-12 14:42:19.557131] =================================================
[2019-06-12 14:42:19.557175] Total Tests Run: 2
[2019-06-12 14:42:19.557220] Total PASS:      2
[2019-06-12 14:42:19.557264] Total FAIL:      0
[2019-06-12 14:42:19.557408] Total TIMEOUT:   0
[2019-06-12 14:42:19.557457] Total KILLED:    0
[2019-06-12 14:42:19.557503] Total STOPPED:   0

Verified for corosync-2.4.3-6.el7

Comment 43 michal novacek 2019-07-10 17:11:03 UTC


The following have been tested to work:

 - create cluster with 32 nodes and separate fencing 

 - create fifty separate Apache resources, move all of them to different node, disable them, remove them

 - recovery: kill pacemaker on fifteen nodes and watch cluster recovery

 - recovery: halt fifteen nodes and watch pacemker fence them, then wait for them to come back

Comment 48 errata-xmlrpc 2019-08-06 13:10:11 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:2245