Bug 1572892 - Corosync is prone to flooding network with JOIN messages
Summary: Corosync is prone to flooding network with JOIN messages
Keywords:
Status: CLOSED DUPLICATE of bug 1618775
Alias: None
Product: Red Hat Enterprise Linux 7
Classification: Red Hat
Component: corosync
Version: 7.6
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: rc
: ---
Assignee: Jan Friesse
QA Contact: cluster-qe@redhat.com
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-04-28 12:35 UTC by Josef Zimek
Modified: 2019-02-18 12:07 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-02-18 12:07:08 UTC
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Bugzilla 1572886 0 high CLOSED add delay to `pcs cluster start --all` to avoid corosync JOIN flood 2021-02-22 00:41:40 UTC
Red Hat Knowledge Base (Solution) 3551671 0 None None None 2018-08-03 14:03:57 UTC

Internal Links: 1572886

Description Josef Zimek 2018-04-28 12:35:20 UTC
Description of problem:

Corosync is prone to flooding network with JOIN messages which may occasionally result in corosync to not join the cluster. More nodes are in cluster the chance of hitting this problem increases - especially if corosync starts at the same time on all nodes. 


For informational purposes this is how corosync fails to join the cluster on above described scenario:

Mar 20 09:24:51 [localhost] corosync[3455]: [TOTEM ] token retrans flag is 0 my set retrans flag0 retrans queue empty 1 count 66, aru 0
Mar 20 09:24:51 [localhost] corosync[3455]: [TOTEM ] install seq 63 aru 63 high seq received 63
Mar 20 09:24:51 [localhost] corosync[3455]: [TOTEM ] token retrans flag is 0 my set retrans flag0 retrans queue empty 1 count 67, aru 0
Mar 20 09:24:51 [localhost] corosync[3455]: [TOTEM ] install seq 63 aru 63 high seq received 63
Mar 20 09:24:51 [localhost] corosync[3455]: [TOTEM ] token retrans flag is 0 my set retrans flag0 retrans queue empty 1 count 68, aru 0
Mar 20 09:24:51 [localhost] corosync[3455]: [TOTEM ] install seq 63 aru 63 high seq received 63
Mar 20 09:24:51 [localhost] corosync[3455]: [TOTEM ] token retrans flag is 0 my set retrans flag0 retrans queue empty 1 count 69, aru 0
Mar 20 09:24:51 [localhost] corosync[3455]: [TOTEM ] install seq 63 aru 63 high seq received 63
Mar 20 09:24:51 [localhost] corosync[3455]: [TOTEM ] token retrans flag is 0 my set retrans flag0 retrans queue empty 1 count 70, aru 0
Mar 20 09:24:51 [localhost] corosync[3455]: [TOTEM ] install seq 63 aru 63 high seq received 63
Mar 20 09:24:51 [localhost] corosync[3455]: [TOTEM ] token retrans flag is 0 my set retrans flag0 retrans queue empty 1 count 71, aru 0
Mar 20 09:24:51 [localhost] corosync[3455]: [TOTEM ] install seq 63 aru 63 high seq received 63
Mar 20 09:24:51 [localhost] corosync[3455]: [TOTEM ] token retrans flag is 0 my set retrans flag0 retrans queue empty 1 count 72, aru 0
Mar 20 09:24:51 [localhost] corosync[3455]: [TOTEM ] install seq 63 aru 63 high seq received 63
Mar 20 09:24:51 [localhost] corosync[3455]: [TOTEM ] token retrans flag is 0 my set retrans flag0 retrans queue empty 1 count 73, aru 0
Mar 20 09:24:51 [localhost] corosync[3455]: [TOTEM ] install seq 63 aru 63 high seq received 63
Mar 20 09:24:51 [localhost] corosync[3455]: [MAIN  ] Denied connection, corosync is not ready
Mar 20 09:24:51 [localhost] corosync[3455]: [QB    ] Denied connection, is not ready (3455-3459-27)
Mar 20 09:24:51 [localhost] corosync[3455]: [MAIN  ] cs_ipcs_connection_destroyed()



Version-Release number of selected component (if applicable):
corosync-2.4.0-9.el7_4.2.x86_64

How reproducible:
Randomly

Steps to Reproduce:
Due to random nature of reproducer there are no clear steps. The fact is that issue was reported when high amount of nodes attempts to start cluster at the same time - `pcs cluster start --all` leads to this situation more often in big clusters. With adoption of ansible massive deployments this issue becomes more obvious.


Actual results:
corosync starts but fails to join the cluster if there is high amount of nodes in cluster and corosyncs start at the same time

Expected results:
corosync on all nodes survive JOIN flood and all of them join cluster

Additional info:

Comment 3 Josef Zimek 2018-04-28 12:41:01 UTC
This problem also affects future features such as support of more than 16 nodes (current max):

https://bugzilla.redhat.com/show_bug.cgi?id=1374857

Comment 4 Josef Zimek 2018-04-28 12:42:46 UTC
Upstream discussion related to this issue tested with higher amount of nodes:
https://lists.clusterlabs.org/pipermail/users/2017-January/004764.html

Comment 6 Jan Friesse 2019-02-18 12:07:08 UTC
Closing this bug as a duplicate of bug 1618775. Adding send_join should help a lot. Corosync behaviour in RHEL 8 should be even better because join/leave list is much smaller (removed ip addresses).

*** This bug has been marked as a duplicate of bug 1618775 ***


Note You need to log in before you can comment on or make changes to this bug.