Red Hat Bugzilla – Bug 1572892
Corosync is prone to flooding network with JOIN messages
Last modified: 2018-08-03 10:03:58 EDT
Description of problem: Corosync is prone to flooding network with JOIN messages which may occasionally result in corosync to not join the cluster. More nodes are in cluster the chance of hitting this problem increases - especially if corosync starts at the same time on all nodes. For informational purposes this is how corosync fails to join the cluster on above described scenario: Mar 20 09:24:51 [localhost] corosync[3455]: [TOTEM ] token retrans flag is 0 my set retrans flag0 retrans queue empty 1 count 66, aru 0 Mar 20 09:24:51 [localhost] corosync[3455]: [TOTEM ] install seq 63 aru 63 high seq received 63 Mar 20 09:24:51 [localhost] corosync[3455]: [TOTEM ] token retrans flag is 0 my set retrans flag0 retrans queue empty 1 count 67, aru 0 Mar 20 09:24:51 [localhost] corosync[3455]: [TOTEM ] install seq 63 aru 63 high seq received 63 Mar 20 09:24:51 [localhost] corosync[3455]: [TOTEM ] token retrans flag is 0 my set retrans flag0 retrans queue empty 1 count 68, aru 0 Mar 20 09:24:51 [localhost] corosync[3455]: [TOTEM ] install seq 63 aru 63 high seq received 63 Mar 20 09:24:51 [localhost] corosync[3455]: [TOTEM ] token retrans flag is 0 my set retrans flag0 retrans queue empty 1 count 69, aru 0 Mar 20 09:24:51 [localhost] corosync[3455]: [TOTEM ] install seq 63 aru 63 high seq received 63 Mar 20 09:24:51 [localhost] corosync[3455]: [TOTEM ] token retrans flag is 0 my set retrans flag0 retrans queue empty 1 count 70, aru 0 Mar 20 09:24:51 [localhost] corosync[3455]: [TOTEM ] install seq 63 aru 63 high seq received 63 Mar 20 09:24:51 [localhost] corosync[3455]: [TOTEM ] token retrans flag is 0 my set retrans flag0 retrans queue empty 1 count 71, aru 0 Mar 20 09:24:51 [localhost] corosync[3455]: [TOTEM ] install seq 63 aru 63 high seq received 63 Mar 20 09:24:51 [localhost] corosync[3455]: [TOTEM ] token retrans flag is 0 my set retrans flag0 retrans queue empty 1 count 72, aru 0 Mar 20 09:24:51 [localhost] corosync[3455]: [TOTEM ] install seq 63 aru 63 high seq received 63 Mar 20 09:24:51 [localhost] corosync[3455]: [TOTEM ] token retrans flag is 0 my set retrans flag0 retrans queue empty 1 count 73, aru 0 Mar 20 09:24:51 [localhost] corosync[3455]: [TOTEM ] install seq 63 aru 63 high seq received 63 Mar 20 09:24:51 [localhost] corosync[3455]: [MAIN ] Denied connection, corosync is not ready Mar 20 09:24:51 [localhost] corosync[3455]: [QB ] Denied connection, is not ready (3455-3459-27) Mar 20 09:24:51 [localhost] corosync[3455]: [MAIN ] cs_ipcs_connection_destroyed() Version-Release number of selected component (if applicable): corosync-2.4.0-9.el7_4.2.x86_64 How reproducible: Randomly Steps to Reproduce: Due to random nature of reproducer there are no clear steps. The fact is that issue was reported when high amount of nodes attempts to start cluster at the same time - `pcs cluster start --all` leads to this situation more often in big clusters. With adoption of ansible massive deployments this issue becomes more obvious. Actual results: corosync starts but fails to join the cluster if there is high amount of nodes in cluster and corosyncs start at the same time Expected results: corosync on all nodes survive JOIN flood and all of them join cluster Additional info:
This problem also affects future features such as support of more than 16 nodes (current max): https://bugzilla.redhat.com/show_bug.cgi?id=1374857
Upstream discussion related to this issue tested with higher amount of nodes: https://lists.clusterlabs.org/pipermail/users/2017-January/004764.html