Bug 905296 - Corosync "entering GATHER state from 11" when the cluster nodes > 64
Summary: Corosync "entering GATHER state from 11" when the cluster nodes > 64
Keywords:
Status: CLOSED EOL
Alias: None
Product: Fedora
Classification: Fedora
Component: corosync
Version: 19
Hardware: x86_64
OS: Linux
unspecified
urgent
Target Milestone: ---
Assignee: Jan Friesse
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2013-01-29 05:30 UTC by Shining
Modified: 2015-02-17 14:42 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2015-02-17 14:42:57 UTC
Type: Bug
Embargoed:


Attachments (Terms of Use)
corosync log file (3.59 MB, application/x-gzip)
2013-01-30 09:57 UTC, Shining
no flags Details

Description Shining 2013-01-29 05:30:42 UTC
Description of problem:
I have a corosync cluster with 80 nodes, but the corosync only works stable when the nodes less than 64. If the nodes gt 64, the corosync falled into infinity loop of "entering GATHER state from 11".

Version-Release number of selected component (if applicable):
corosync 1.4.5


How reproducible:
you must have nodes number gt 64.


Steps to Reproduce:
1. cexec "service corosync start"
2. On one node, execute "service corosync stop" and "service corosync start"
3. all cluster nodes enter "entering GATHER state from 11"
  
Actual results:
"corosync-cfgtool -s" return successfule

Expected results:
"corosync-cfgtool -s" return 
------------------------
Printing ring status.
Local node ID 17803456
Could not get the ring status, the error is: 6
------------------------


Additional info:

corosync.conf
------------------------
totem {
        version: 2
        secauth: off
        threads: 0

        interface {
                ringnumber: 0
                bindnetaddr: 192.168.15.1
                mcastaddr: 226.94.3.33
                mcastport: 5333
                ttl: 1
        }
}

logging {
        fileline: off
        to_stderr: no
        to_logfile: yes
        to_syslog: no
        logfile: /var/log/corosync.log
        debug: on
        timestamp: on
        logger_subsys {
                subsys: AMF
                debug: off
        }
}

amf {
        mode: disabled
}
------------------------

Comment 1 Shining 2013-01-29 05:40:58 UTC
except "entering GATHER state from 11", some other nodes have the following log:

--------- node73---------
Jan 29 12:59:42 corosync [SYNC  ] Barrier completion status for nodeid 1192208576 = 1. 
Jan 29 12:59:42 corosync [SYNC  ] Barrier completion status for nodeid 1208985792 = 1. 
Jan 29 12:59:42 corosync [SYNC  ] Barrier completion status for nodeid 1225763008 = 1. 
Jan 29 12:59:42 corosync [SYNC  ] Barrier completion status for nodeid 1242540224 = 1. 
Jan 29 12:59:42 corosync [SYNC  ] Barrier completion status for nodeid 1259317440 = 1. 
Jan 29 12:59:42 corosync [SYNC  ] Barrier completion status for nodeid 1276094656 = 1. 
Jan 29 12:59:42 corosync [SYNC  ] Synchronization barrier completed
Jan 29 12:59:42 corosync [SYNC  ] Committing synchronization for (corosync cluster closed process group service v1.01)
Jan 29 12:59:42 corosync [MAIN  ] Completed service synchronization, ready to provide service.
Jan 29 12:59:42 corosync [TOTEM ] waiting_trans_ack changed to 0
--------- node74---------
Jan 29 12:58:27 corosync [SYNC  ] Barrier completion status for nodeid 1192208576 = 1. 
Jan 29 12:58:27 corosync [SYNC  ] Barrier completion status for nodeid 1208985792 = 1. 
Jan 29 12:58:27 corosync [SYNC  ] Barrier completion status for nodeid 1225763008 = 1. 
Jan 29 12:58:27 corosync [SYNC  ] Barrier completion status for nodeid 1242540224 = 1. 
Jan 29 12:58:27 corosync [SYNC  ] Barrier completion status for nodeid 1259317440 = 1. 
Jan 29 12:58:27 corosync [SYNC  ] Barrier completion status for nodeid 1276094656 = 1. 
Jan 29 12:58:27 corosync [SYNC  ] Synchronization barrier completed
Jan 29 12:58:27 corosync [SYNC  ] Committing synchronization for (corosync cluster closed process group service v1.01)
Jan 29 12:58:27 corosync [MAIN  ] Completed service synchronization, ready to provide service.
Jan 29 12:58:27 corosync [TOTEM ] waiting_trans_ack changed to 0
--------- node75---------
Jan 29 12:58:53 corosync [SYNC  ] Barrier completion status for nodeid 1192208576 = 1. 
Jan 29 12:58:53 corosync [SYNC  ] Barrier completion status for nodeid 1208985792 = 1. 
Jan 29 12:58:53 corosync [SYNC  ] Barrier completion status for nodeid 1225763008 = 1. 
Jan 29 12:58:53 corosync [SYNC  ] Barrier completion status for nodeid 1242540224 = 1. 
Jan 29 12:58:53 corosync [SYNC  ] Barrier completion status for nodeid 1259317440 = 1. 
Jan 29 12:58:53 corosync [SYNC  ] Barrier completion status for nodeid 1276094656 = 1. 
Jan 29 12:58:53 corosync [SYNC  ] Synchronization barrier completed
Jan 29 12:58:53 corosync [SYNC  ] Committing synchronization for (corosync cluster closed process group service v1.01)
Jan 29 12:58:53 corosync [MAIN  ] Completed service synchronization, ready to provide service.
Jan 29 12:58:53 corosync [TOTEM ] waiting_trans_ack changed to 0
--------- node76---------
Jan 29 12:58:07 corosync [SYNC  ] Barrier completion status for nodeid 1192208576 = 1. 
Jan 29 12:58:07 corosync [SYNC  ] Barrier completion status for nodeid 1208985792 = 1. 
Jan 29 12:58:07 corosync [SYNC  ] Barrier completion status for nodeid 1225763008 = 1. 
Jan 29 12:58:07 corosync [SYNC  ] Barrier completion status for nodeid 1242540224 = 1. 
Jan 29 12:58:07 corosync [SYNC  ] Barrier completion status for nodeid 1259317440 = 1. 
Jan 29 12:58:07 corosync [SYNC  ] Barrier completion status for nodeid 1276094656 = 1. 
Jan 29 12:58:07 corosync [SYNC  ] Synchronization barrier completed
Jan 29 12:58:07 corosync [SYNC  ] Committing synchronization for (corosync cluster closed process group service v1.01)
Jan 29 12:58:07 corosync [MAIN  ] Completed service synchronization, ready to provide service.
Jan 29 12:58:07 corosync [TOTEM ] waiting_trans_ack changed to 0

Comment 3 Shining 2013-01-30 09:57:39 UTC
Created attachment 690259 [details]
corosync log file

Part options in corosync.conf
-----------------------------------------------
        token: 5000
        token_retransmits_before_loss_const: 20
        join: 1000
        send_join: 200
        consensus: 6000
        vsftype: none
        max_messages: 20
-----------------------------------------------

Comment 4 Shining 2013-02-01 04:42:46 UTC
all node shows:
--------- node77---------
Feb 01 12:25:07 corosync [TOTEM ] token retrans flag is 1 my set retrans flag0 retrans queue empty 0 count 0, aru 3fce
Feb 01 12:25:07 corosync [TOTEM ] install seq 0 aru 3fce high seq received 3fce
Feb 01 12:25:08 corosync [TOTEM ] token retrans flag is 1 my set retrans flag0 retrans queue empty 0 count 0, aru 3fce
Feb 01 12:25:08 corosync [TOTEM ] install seq 0 aru 3fce high seq received 3fce
Feb 01 12:25:08 corosync [TOTEM ] token retrans flag is 1 my set retrans flag0 retrans queue empty 0 count 0, aru 3fce
Feb 01 12:25:08 corosync [TOTEM ] install seq 0 aru 3fce high seq received 3fce
Feb 01 12:25:09 corosync [TOTEM ] token retrans flag is 1 my set retrans flag0 retrans queue empty 0 count 0, aru 3fce
Feb 01 12:25:09 corosync [TOTEM ] install seq 0 aru 3fce high seq received 3fce
Feb 01 12:25:10 corosync [TOTEM ] token retrans flag is 1 my set retrans flag0 retrans queue empty 0 count 0, aru 3fce
Feb 01 12:25:10 corosync [TOTEM ] install seq 0 aru 3fce high seq received 3fce
--------- node78---------
Feb 01 12:23:48 corosync [TOTEM ] token retrans flag is 1 my set retrans flag0 retrans queue empty 0 count 0, aru 3fce
Feb 01 12:23:48 corosync [TOTEM ] install seq 0 aru 3fce high seq received 3fce
Feb 01 12:23:49 corosync [TOTEM ] token retrans flag is 1 my set retrans flag0 retrans queue empty 0 count 0, aru 3fce
Feb 01 12:23:49 corosync [TOTEM ] install seq 0 aru 3fce high seq received 3fce
Feb 01 12:23:49 corosync [TOTEM ] token retrans flag is 1 my set retrans flag0 retrans queue empty 0 count 0, aru 3fce
Feb 01 12:23:49 corosync [TOTEM ] install seq 0 aru 3fce high seq received 3fce
Feb 01 12:23:50 corosync [TOTEM ] token retrans flag is 1 my set retrans flag0 retrans queue empty 0 count 0, aru 3fce
Feb 01 12:23:50 corosync [TOTEM ] install seq 0 aru 3fce high seq received 3fce
Feb 01 12:23:51 corosync [TOTEM ] token retrans flag is 1 my set retrans flag0 retrans queue empty 0 count 0, aru 3fce
Feb 01 12:23:51 corosync [TOTEM ] install seq 0 aru 3fce high seq received 3fce
--------- node79---------
Feb 01 12:23:29 corosync [TOTEM ] token retrans flag is 1 my set retrans flag0 retrans queue empty 0 count 0, aru 3fce
Feb 01 12:23:29 corosync [TOTEM ] install seq 0 aru 3fce high seq received 3fce
Feb 01 12:23:29 corosync [TOTEM ] token retrans flag is 1 my set retrans flag0 retrans queue empty 0 count 0, aru 3fce
Feb 01 12:23:29 corosync [TOTEM ] install seq 0 aru 3fce high seq received 3fce
Feb 01 12:23:30 corosync [TOTEM ] token retrans flag is 1 my set retrans flag0 retrans queue empty 0 count 0, aru 3fce
Feb 01 12:23:30 corosync [TOTEM ] install seq 0 aru 3fce high seq received 3fce
Feb 01 12:23:31 corosync [TOTEM ] token retrans flag is 1 my set retrans flag0 retrans queue empty 0 count 0, aru 3fce
Feb 01 12:23:31 corosync [TOTEM ] install seq 0 aru 3fce high seq received 3fce
Feb 01 12:23:32 corosync [TOTEM ] token retrans flag is 1 my set retrans flag0 retrans queue empty 0 count 0, aru 3fce
Feb 01 12:23:32 corosync [TOTEM ] install seq 0 aru 3fce high seq received 3fce
--------- node80---------
Feb 01 12:22:23 corosync [TOTEM ] token retrans flag is 1 my set retrans flag0 retrans queue empty 0 count 0, aru 3fce
Feb 01 12:22:23 corosync [TOTEM ] install seq 0 aru 3fce high seq received 3fce
Feb 01 12:22:24 corosync [TOTEM ] token retrans flag is 1 my set retrans flag0 retrans queue empty 0 count 0, aru 3fce
Feb 01 12:22:24 corosync [TOTEM ] install seq 0 aru 3fce high seq received 3fce
Feb 01 12:22:25 corosync [TOTEM ] token retrans flag is 1 my set retrans flag0 retrans queue empty 0 count 0, aru 3fce
Feb 01 12:22:25 corosync [TOTEM ] install seq 0 aru 3fce high seq received 3fce
Feb 01 12:22:26 corosync [TOTEM ] token retrans flag is 1 my set retrans flag0 retrans queue empty 0 count 0, aru 3fce
Feb 01 12:22:26 corosync [TOTEM ] install seq 0 aru 3fce high seq received 3fce
Feb 01 12:22:27 corosync [TOTEM ] token retrans flag is 1 my set retrans flag0 retrans queue empty 0 count 0, aru 3fce
Feb 01 12:22:27 corosync [TOTEM ] install seq 0 aru 3fce high seq received 3fce

Comment 5 Jan Friesse 2013-02-01 11:24:45 UTC
Corosync has hardcoded maximum number of nodes to 64. It's also good to note, that officially supported is like 16 nodes, so no wonder 80 doesn't work. I believe this BZ is really irrelevant for RHEL 6, but may be interesting for upstream, so moving to Fedora.

Comment 6 Shining 2013-02-04 01:46:58 UTC
(In reply to comment #5)
> Corosync has hardcoded maximum number of nodes to 64. It's also good to
> note, that officially supported is like 16 nodes, so no wonder 80 doesn't
> work. I believe this BZ is really irrelevant for RHEL 6, but may be
> interesting for upstream, so moving to Fedora.

Could you give me some hint to change the hardcoded maximun number of nodes?

Comment 7 Jan Friesse 2013-02-04 09:03:40 UTC
(In reply to comment #6)
> (In reply to comment #5)
> > Corosync has hardcoded maximum number of nodes to 64. It's also good to
> > note, that officially supported is like 16 nodes, so no wonder 80 doesn't
> > work. I believe this BZ is really irrelevant for RHEL 6, but may be
> > interesting for upstream, so moving to Fedora.
> 
> Could you give me some hint to change the hardcoded maximun number of nodes?

Sadly I cannot. One (really hardcoded) is PROCESSOR_COUNT_MAX which is 384 (so it will not help to change anything there). Other values is somewhat somewhere spread in the code. I will more then happy accept patches (if you will found out what is real reason).

I have one (unrelated) question. Why you need 80 fully synced nodes?

Comment 8 Fedora End Of Life 2013-04-03 15:13:52 UTC
This bug appears to have been reported against 'rawhide' during the Fedora 19 development cycle.
Changing version to '19'.

(As we did not run this process for some time, it could affect also pre-Fedora 19 development
cycle bugs. We are very sorry. It will help us with cleanup during Fedora 19 End Of Life. Thank you.)

More information and reason for this action is here:
https://fedoraproject.org/wiki/BugZappers/HouseKeeping/Fedora19

Comment 9 Fedora End Of Life 2015-01-09 17:37:24 UTC
This message is a notice that Fedora 19 is now at end of life. Fedora 
has stopped maintaining and issuing updates for Fedora 19. It is 
Fedora's policy to close all bug reports from releases that are no 
longer maintained. Approximately 4 (four) weeks from now this bug will
be closed as EOL if it remains open with a Fedora 'version' of '19'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, simply change the 'version' 
to a later Fedora version.

Thank you for reporting this issue and we are sorry that we were not 
able to fix it before Fedora 19 is end of life. If you would still like 
to see this bug fixed and are able to reproduce it against a later version 
of Fedora, you are encouraged  change the 'version' to a later Fedora 
version prior this bug is closed as described in the policy above.

Although we aim to fix as many bugs as possible during every release's 
lifetime, sometimes those efforts are overtaken by events. Often a 
more recent Fedora release includes newer upstream software that fixes 
bugs or makes them obsolete.

Comment 10 Fedora End Of Life 2015-02-17 14:42:57 UTC
Fedora 19 changed to end-of-life (EOL) status on 2015-01-06. Fedora 19 is
no longer maintained, which means that it will not receive any further
security or bug fix updates. As a result we are closing this bug.

If you can reproduce this bug against a currently maintained version of
Fedora please feel free to reopen this bug against that version. If you
are unable to reopen this bug, please file a new report against the
current release. If you experience problems, please add a comment to this
bug.

Thank you for reporting this bug and we are sorry it could not be fixed.


Note You need to log in before you can comment on or make changes to this bug.