Bugzilla will be upgraded to version 5.0. The upgrade date is tentatively scheduled for 2 December 2018, pending final testing and feedback.
Bug 1572892 - Corosync is prone to flooding network with JOIN messages
Corosync is prone to flooding network with JOIN messages
Status: NEW
Product: Red Hat Enterprise Linux 7
Classification: Red Hat
Component: corosync (Show other bugs)
7.6
Unspecified Unspecified
unspecified Severity unspecified
: rc
: ---
Assigned To: Jan Friesse
cluster-qe@redhat.com
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2018-04-28 08:35 EDT by Josef Zimek
Modified: 2018-08-03 10:03 EDT (History)
3 users (show)

See Also:
Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed:
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)


External Trackers
Tracker ID Priority Status Summary Last Updated
Red Hat Knowledge Base (Solution) 3551671 None None None 2018-08-03 10:03 EDT

  None (edit)
Description Josef Zimek 2018-04-28 08:35:20 EDT
Description of problem:

Corosync is prone to flooding network with JOIN messages which may occasionally result in corosync to not join the cluster. More nodes are in cluster the chance of hitting this problem increases - especially if corosync starts at the same time on all nodes. 


For informational purposes this is how corosync fails to join the cluster on above described scenario:

Mar 20 09:24:51 [localhost] corosync[3455]: [TOTEM ] token retrans flag is 0 my set retrans flag0 retrans queue empty 1 count 66, aru 0
Mar 20 09:24:51 [localhost] corosync[3455]: [TOTEM ] install seq 63 aru 63 high seq received 63
Mar 20 09:24:51 [localhost] corosync[3455]: [TOTEM ] token retrans flag is 0 my set retrans flag0 retrans queue empty 1 count 67, aru 0
Mar 20 09:24:51 [localhost] corosync[3455]: [TOTEM ] install seq 63 aru 63 high seq received 63
Mar 20 09:24:51 [localhost] corosync[3455]: [TOTEM ] token retrans flag is 0 my set retrans flag0 retrans queue empty 1 count 68, aru 0
Mar 20 09:24:51 [localhost] corosync[3455]: [TOTEM ] install seq 63 aru 63 high seq received 63
Mar 20 09:24:51 [localhost] corosync[3455]: [TOTEM ] token retrans flag is 0 my set retrans flag0 retrans queue empty 1 count 69, aru 0
Mar 20 09:24:51 [localhost] corosync[3455]: [TOTEM ] install seq 63 aru 63 high seq received 63
Mar 20 09:24:51 [localhost] corosync[3455]: [TOTEM ] token retrans flag is 0 my set retrans flag0 retrans queue empty 1 count 70, aru 0
Mar 20 09:24:51 [localhost] corosync[3455]: [TOTEM ] install seq 63 aru 63 high seq received 63
Mar 20 09:24:51 [localhost] corosync[3455]: [TOTEM ] token retrans flag is 0 my set retrans flag0 retrans queue empty 1 count 71, aru 0
Mar 20 09:24:51 [localhost] corosync[3455]: [TOTEM ] install seq 63 aru 63 high seq received 63
Mar 20 09:24:51 [localhost] corosync[3455]: [TOTEM ] token retrans flag is 0 my set retrans flag0 retrans queue empty 1 count 72, aru 0
Mar 20 09:24:51 [localhost] corosync[3455]: [TOTEM ] install seq 63 aru 63 high seq received 63
Mar 20 09:24:51 [localhost] corosync[3455]: [TOTEM ] token retrans flag is 0 my set retrans flag0 retrans queue empty 1 count 73, aru 0
Mar 20 09:24:51 [localhost] corosync[3455]: [TOTEM ] install seq 63 aru 63 high seq received 63
Mar 20 09:24:51 [localhost] corosync[3455]: [MAIN  ] Denied connection, corosync is not ready
Mar 20 09:24:51 [localhost] corosync[3455]: [QB    ] Denied connection, is not ready (3455-3459-27)
Mar 20 09:24:51 [localhost] corosync[3455]: [MAIN  ] cs_ipcs_connection_destroyed()



Version-Release number of selected component (if applicable):
corosync-2.4.0-9.el7_4.2.x86_64

How reproducible:
Randomly

Steps to Reproduce:
Due to random nature of reproducer there are no clear steps. The fact is that issue was reported when high amount of nodes attempts to start cluster at the same time - `pcs cluster start --all` leads to this situation more often in big clusters. With adoption of ansible massive deployments this issue becomes more obvious.


Actual results:
corosync starts but fails to join the cluster if there is high amount of nodes in cluster and corosyncs start at the same time

Expected results:
corosync on all nodes survive JOIN flood and all of them join cluster

Additional info:
Comment 3 Josef Zimek 2018-04-28 08:41:01 EDT
This problem also affects future features such as support of more than 16 nodes (current max):

https://bugzilla.redhat.com/show_bug.cgi?id=1374857
Comment 4 Josef Zimek 2018-04-28 08:42:46 EDT
Upstream discussion related to this issue tested with higher amount of nodes:
https://lists.clusterlabs.org/pipermail/users/2017-January/004764.html

Note You need to log in before you can comment on or make changes to this bug.