Bug 1572892

Summary: Corosync is prone to flooding network with JOIN messages
Product: Red Hat Enterprise Linux 7 Reporter: Josef Zimek <pzimek>
Component: corosyncAssignee: Jan Friesse <jfriesse>
Status: CLOSED DUPLICATE QA Contact: cluster-qe <cluster-qe>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 7.6CC: ccaulfie, cluster-maint, sbradley
Target Milestone: rc   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-02-18 12:07:08 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Josef Zimek 2018-04-28 12:35:20 UTC
Description of problem:

Corosync is prone to flooding network with JOIN messages which may occasionally result in corosync to not join the cluster. More nodes are in cluster the chance of hitting this problem increases - especially if corosync starts at the same time on all nodes. 


For informational purposes this is how corosync fails to join the cluster on above described scenario:

Mar 20 09:24:51 [localhost] corosync[3455]: [TOTEM ] token retrans flag is 0 my set retrans flag0 retrans queue empty 1 count 66, aru 0
Mar 20 09:24:51 [localhost] corosync[3455]: [TOTEM ] install seq 63 aru 63 high seq received 63
Mar 20 09:24:51 [localhost] corosync[3455]: [TOTEM ] token retrans flag is 0 my set retrans flag0 retrans queue empty 1 count 67, aru 0
Mar 20 09:24:51 [localhost] corosync[3455]: [TOTEM ] install seq 63 aru 63 high seq received 63
Mar 20 09:24:51 [localhost] corosync[3455]: [TOTEM ] token retrans flag is 0 my set retrans flag0 retrans queue empty 1 count 68, aru 0
Mar 20 09:24:51 [localhost] corosync[3455]: [TOTEM ] install seq 63 aru 63 high seq received 63
Mar 20 09:24:51 [localhost] corosync[3455]: [TOTEM ] token retrans flag is 0 my set retrans flag0 retrans queue empty 1 count 69, aru 0
Mar 20 09:24:51 [localhost] corosync[3455]: [TOTEM ] install seq 63 aru 63 high seq received 63
Mar 20 09:24:51 [localhost] corosync[3455]: [TOTEM ] token retrans flag is 0 my set retrans flag0 retrans queue empty 1 count 70, aru 0
Mar 20 09:24:51 [localhost] corosync[3455]: [TOTEM ] install seq 63 aru 63 high seq received 63
Mar 20 09:24:51 [localhost] corosync[3455]: [TOTEM ] token retrans flag is 0 my set retrans flag0 retrans queue empty 1 count 71, aru 0
Mar 20 09:24:51 [localhost] corosync[3455]: [TOTEM ] install seq 63 aru 63 high seq received 63
Mar 20 09:24:51 [localhost] corosync[3455]: [TOTEM ] token retrans flag is 0 my set retrans flag0 retrans queue empty 1 count 72, aru 0
Mar 20 09:24:51 [localhost] corosync[3455]: [TOTEM ] install seq 63 aru 63 high seq received 63
Mar 20 09:24:51 [localhost] corosync[3455]: [TOTEM ] token retrans flag is 0 my set retrans flag0 retrans queue empty 1 count 73, aru 0
Mar 20 09:24:51 [localhost] corosync[3455]: [TOTEM ] install seq 63 aru 63 high seq received 63
Mar 20 09:24:51 [localhost] corosync[3455]: [MAIN  ] Denied connection, corosync is not ready
Mar 20 09:24:51 [localhost] corosync[3455]: [QB    ] Denied connection, is not ready (3455-3459-27)
Mar 20 09:24:51 [localhost] corosync[3455]: [MAIN  ] cs_ipcs_connection_destroyed()



Version-Release number of selected component (if applicable):
corosync-2.4.0-9.el7_4.2.x86_64

How reproducible:
Randomly

Steps to Reproduce:
Due to random nature of reproducer there are no clear steps. The fact is that issue was reported when high amount of nodes attempts to start cluster at the same time - `pcs cluster start --all` leads to this situation more often in big clusters. With adoption of ansible massive deployments this issue becomes more obvious.


Actual results:
corosync starts but fails to join the cluster if there is high amount of nodes in cluster and corosyncs start at the same time

Expected results:
corosync on all nodes survive JOIN flood and all of them join cluster

Additional info:

Comment 3 Josef Zimek 2018-04-28 12:41:01 UTC
This problem also affects future features such as support of more than 16 nodes (current max):

https://bugzilla.redhat.com/show_bug.cgi?id=1374857

Comment 4 Josef Zimek 2018-04-28 12:42:46 UTC
Upstream discussion related to this issue tested with higher amount of nodes:
https://lists.clusterlabs.org/pipermail/users/2017-January/004764.html

Comment 6 Jan Friesse 2019-02-18 12:07:08 UTC
Closing this bug as a duplicate of bug 1618775. Adding send_join should help a lot. Corosync behaviour in RHEL 8 should be even better because join/leave list is much smaller (removed ip addresses).

*** This bug has been marked as a duplicate of bug 1618775 ***