Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1572892

Summary:	Corosync is prone to flooding network with JOIN messages
Product:	Red Hat Enterprise Linux 7	Reporter:	Josef Zimek <pzimek>
Component:	corosync	Assignee:	Jan Friesse <jfriesse>
Status:	CLOSED DUPLICATE	QA Contact:	cluster-qe <cluster-qe>
Severity:	unspecified	Docs Contact:
Priority:	unspecified
Version:	7.6	CC:	ccaulfie, cluster-maint, sbradley
Target Milestone:	rc
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2019-02-18 12:07:08 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Josef Zimek 2018-04-28 12:35:20 UTC

Description of problem:

Corosync is prone to flooding network with JOIN messages which may occasionally result in corosync to not join the cluster. More nodes are in cluster the chance of hitting this problem increases - especially if corosync starts at the same time on all nodes. 


For informational purposes this is how corosync fails to join the cluster on above described scenario:

Mar 20 09:24:51 [localhost] corosync[3455]: [TOTEM ] token retrans flag is 0 my set retrans flag0 retrans queue empty 1 count 66, aru 0
Mar 20 09:24:51 [localhost] corosync[3455]: [TOTEM ] install seq 63 aru 63 high seq received 63
Mar 20 09:24:51 [localhost] corosync[3455]: [TOTEM ] token retrans flag is 0 my set retrans flag0 retrans queue empty 1 count 67, aru 0
Mar 20 09:24:51 [localhost] corosync[3455]: [TOTEM ] install seq 63 aru 63 high seq received 63
Mar 20 09:24:51 [localhost] corosync[3455]: [TOTEM ] token retrans flag is 0 my set retrans flag0 retrans queue empty 1 count 68, aru 0
Mar 20 09:24:51 [localhost] corosync[3455]: [TOTEM ] install seq 63 aru 63 high seq received 63
Mar 20 09:24:51 [localhost] corosync[3455]: [TOTEM ] token retrans flag is 0 my set retrans flag0 retrans queue empty 1 count 69, aru 0
Mar 20 09:24:51 [localhost] corosync[3455]: [TOTEM ] install seq 63 aru 63 high seq received 63
Mar 20 09:24:51 [localhost] corosync[3455]: [TOTEM ] token retrans flag is 0 my set retrans flag0 retrans queue empty 1 count 70, aru 0
Mar 20 09:24:51 [localhost] corosync[3455]: [TOTEM ] install seq 63 aru 63 high seq received 63
Mar 20 09:24:51 [localhost] corosync[3455]: [TOTEM ] token retrans flag is 0 my set retrans flag0 retrans queue empty 1 count 71, aru 0
Mar 20 09:24:51 [localhost] corosync[3455]: [TOTEM ] install seq 63 aru 63 high seq received 63
Mar 20 09:24:51 [localhost] corosync[3455]: [TOTEM ] token retrans flag is 0 my set retrans flag0 retrans queue empty 1 count 72, aru 0
Mar 20 09:24:51 [localhost] corosync[3455]: [TOTEM ] install seq 63 aru 63 high seq received 63
Mar 20 09:24:51 [localhost] corosync[3455]: [TOTEM ] token retrans flag is 0 my set retrans flag0 retrans queue empty 1 count 73, aru 0
Mar 20 09:24:51 [localhost] corosync[3455]: [TOTEM ] install seq 63 aru 63 high seq received 63
Mar 20 09:24:51 [localhost] corosync[3455]: [MAIN  ] Denied connection, corosync is not ready
Mar 20 09:24:51 [localhost] corosync[3455]: [QB    ] Denied connection, is not ready (3455-3459-27)
Mar 20 09:24:51 [localhost] corosync[3455]: [MAIN  ] cs_ipcs_connection_destroyed()



Version-Release number of selected component (if applicable):
corosync-2.4.0-9.el7_4.2.x86_64

How reproducible:
Randomly

Steps to Reproduce:
Due to random nature of reproducer there are no clear steps. The fact is that issue was reported when high amount of nodes attempts to start cluster at the same time - `pcs cluster start --all` leads to this situation more often in big clusters. With adoption of ansible massive deployments this issue becomes more obvious.


Actual results:
corosync starts but fails to join the cluster if there is high amount of nodes in cluster and corosyncs start at the same time

Expected results:
corosync on all nodes survive JOIN flood and all of them join cluster

Additional info:

Comment 3 Josef Zimek 2018-04-28 12:41:01 UTC

This problem also affects future features such as support of more than 16 nodes (current max):

https://bugzilla.redhat.com/show_bug.cgi?id=1374857

Comment 4 Josef Zimek 2018-04-28 12:42:46 UTC

Upstream discussion related to this issue tested with higher amount of nodes:
https://lists.clusterlabs.org/pipermail/users/2017-January/004764.html

Comment 6 Jan Friesse 2019-02-18 12:07:08 UTC

Closing this bug as a duplicate of bug 1618775. Adding send_join should help a lot. Corosync behaviour in RHEL 8 should be even better because join/leave list is much smaller (removed ip addresses).

*** This bug has been marked as a duplicate of bug 1618775 ***