Bug 611676 - cascading failures result in failure of sync to abort in-process synchronization operation
cascading failures result in failure of sync to abort in-process synchronizat...
Status: CLOSED CURRENTRELEASE
Product: Red Hat Enterprise Linux 6
Classification: Red Hat
Component: corosync (Show other bugs)
6.0
All Linux
low Severity high
: rc
: ---
Assigned To: Angus Salkeld
Cluster QE
:
Depends On:
Blocks: 610815
  Show dependency treegraph
 
Reported: 2010-07-06 02:18 EDT by Andrew Beekhof
Modified: 2010-11-10 17:07 EST (History)
5 users (show)

See Also:
Fixed In Version: corosync-1.2.3-15.el6
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2010-11-10 17:07:21 EST
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
output of corosync-blackbox (1.56 MB, text/plain)
2010-07-13 11:05 EDT, Angus Salkeld
no flags Details
fix to make sync abort properly (1.05 KB, patch)
2010-07-15 23:04 EDT, Angus Salkeld
no flags Details | Diff
patch: reset ring id in the sync abort (790 bytes, patch)
2010-07-18 18:03 EDT, Angus Salkeld
no flags Details | Diff

  None (edit)
Description Andrew Beekhof 2010-07-06 02:18:49 EDT
Description of problem:

In a four node cluster, with three nodes shutting down in quick succession, corosync tales over 15 minutes to compute a new membership containing only the final node.


Jul  5 20:49:09 pcmk-4 corosync[23346]:   [TOTEM ] A processor joined or left the membership and a new membership was formed.
Jul  5 20:49:09 pcmk-4 corosync[23346]:   [QUORUM] Members[2]: 101 104
Jul  5 20:49:09 pcmk-4 corosync[23346]:   [QUORUM] Members[2]: 101 104
Jul  5 20:49:10 pcmk-4 corosync[23346]:   [QUORUM] Members[1]: 104
Jul  5 20:49:10 pcmk-4 corosync[23346]:   [TOTEM ] A processor joined or left the membership and a new membership was formed.
Jul  5 21:06:59 pcmk-4 corosync[23346]:   [TOTEM ] A processor joined or left the membership and a new membership was formed.


Version-Release number of selected component (if applicable):

Jul  5 20:42:52 pcmk-4 corosync[23346]:   [MAIN  ] Corosync Cluster Engine ('1.2.5'): started and ready to provide service.

How reproducible:

Using the pacemaker test suite, at least twice per hour.

Steps to Reproduce:
1. Four node cluster, all nodes active
2. Tell all four to shutdown at the same time
3.
  
Actual results:

Corosync takes more than 15 minutes to realize it is alone.

Expected results:

Corosync takes a few seconds to realize it is alone.

Additional info:

Totem config
        # How long before declaring a token lost (ms)
        token:          5000

        # How many token retransmits before forming a new configuration
        token_retransmits_before_loss_const: 10

        # How long to wait for join messages in the membership protocol (ms)
        join:           1000

        # How long to wait for consensus to be achieved before starting a new round of membership configuration (ms)
        consensus:      7500

Full logs:


Jul  5 20:42:52 pcmk-4 corosync[23346]:   [MAIN  ] Corosync Cluster Engine ('1.2.5'): started and ready to provide service.
Jul  5 20:42:52 pcmk-4 corosync[23346]:   [MAIN  ] Corosync built-in features: nss rdma
Jul  5 20:42:52 pcmk-4 corosync[23346]:   [MAIN  ] Successfully read main configuration file '/etc/corosync/corosync.conf'.
Jul  5 20:42:52 pcmk-4 corosync[23346]:   [TOTEM ] Initializing transport (UDP/IP).
Jul  5 20:42:52 pcmk-4 corosync[23346]:   [TOTEM ] Initializing transmit/receive security: libtomcrypt SOBER128/SHA1HMAC (mode 0).
Jul  5 20:42:52 pcmk-4 corosync[23346]:   [TOTEM ] The network interface [192.168.122.104] is now up.
Jul  5 20:42:52 pcmk-4 corosync[23346]:   [CMAN  ] CMAN 3.0.12 (built May 26 2010 04:49:43) started
Jul  5 20:42:52 pcmk-4 corosync[23346]:   [SERV  ] Service engine loaded: corosync CMAN membership service 2.90
Jul  5 20:42:52 pcmk-4 corosync[23346]:   [SERV  ] Service engine loaded: corosync extended virtual synchrony service
Jul  5 20:42:52 pcmk-4 corosync[23346]:   [SERV  ] Service engine loaded: corosync configuration service
Jul  5 20:42:52 pcmk-4 corosync[23346]:   [SERV  ] Service engine loaded: corosync cluster closed process group service v1.01
Jul  5 20:42:52 pcmk-4 corosync[23346]:   [SERV  ] Service engine loaded: corosync cluster config database access v1.01
Jul  5 20:42:52 pcmk-4 corosync[23346]:   [SERV  ] Service engine loaded: corosync profile loading service
Jul  5 20:42:52 pcmk-4 corosync[23346]:   [QUORUM] Using quorum provider quorum_cman
Jul  5 20:42:52 pcmk-4 corosync[23346]:   [SERV  ] Service engine loaded: corosync cluster quorum service v0.1
Jul  5 20:42:52 pcmk-4 corosync[23346]:   [MAIN  ] Compatibility mode set to whitetank.  Using V1 and V2 of the synchronization engine.
Jul  5 20:42:52 pcmk-4 corosync[23346]:   [TOTEM ] A processor joined or left the membership and a new membership was formed.
Jul  5 20:42:52 pcmk-4 corosync[23346]:   [CMAN  ] quorum regained, resuming activity
Jul  5 20:42:52 pcmk-4 corosync[23346]:   [QUORUM] This node is within the primary component and will provide service.
Jul  5 20:42:52 pcmk-4 corosync[23346]:   [QUORUM] Members[1]: 104
Jul  5 20:42:52 pcmk-4 corosync[23346]:   [QUORUM] Members[1]: 104
Jul  5 20:42:52 pcmk-4 corosync[23346]:   [QUORUM] Members[2]: 101 104
Jul  5 20:42:52 pcmk-4 corosync[23346]:   [QUORUM] Members[2]: 101 104
Jul  5 20:42:52 pcmk-4 corosync[23346]:   [TOTEM ] A processor joined or left the membership and a new membership was formed.
Jul  5 20:42:52 pcmk-4 corosync[23346]:   [QUORUM] Members[3]: 101 103 104
Jul  5 20:42:52 pcmk-4 corosync[23346]:   [QUORUM] Members[3]: 101 103 104
Jul  5 20:42:52 pcmk-4 corosync[23346]:   [MAIN  ] Completed service synchronization, ready to provide service.
Jul  5 20:42:52 pcmk-4 corosync[23346]:   [TOTEM ] A processor joined or left the membership and a new membership was formed.
Jul  5 20:42:52 pcmk-4 corosync[23346]:   [QUORUM] Members[4]: 101 102 103 104
Jul  5 20:42:52 pcmk-4 corosync[23346]:   [QUORUM] Members[4]: 101 102 103 104
Jul  5 20:42:52 pcmk-4 corosync[23346]:   [MAIN  ] Completed service synchronization, ready to provide service.
Jul  5 20:49:09 pcmk-4 corosync[23346]:   [QUORUM] Members[3]: 101 102 104
Jul  5 20:49:09 pcmk-4 corosync[23346]:   [TOTEM ] A processor joined or left the membership and a new membership was formed.
Jul  5 20:49:09 pcmk-4 corosync[23346]:   [MAIN  ] Completed service synchronization, ready to provide service.
Jul  5 20:49:09 pcmk-4 corosync[23346]:   [CMAN  ] quorum lost, blocking activity
Jul  5 20:49:09 pcmk-4 corosync[23346]:   [QUORUM] This node is within the non-primary component and will NOT provide any services.
Jul  5 20:49:09 pcmk-4 corosync[23346]:   [QUORUM] Members[2]: 101 104
Jul  5 20:49:09 pcmk-4 corosync[23346]:   [TOTEM ] A processor joined or left the membership and a new membership was formed.
Jul  5 20:49:09 pcmk-4 corosync[23346]:   [QUORUM] Members[2]: 101 104
Jul  5 20:49:09 pcmk-4 corosync[23346]:   [QUORUM] Members[2]: 101 104
Jul  5 20:49:10 pcmk-4 corosync[23346]:   [QUORUM] Members[1]: 104
Jul  5 20:49:10 pcmk-4 corosync[23346]:   [TOTEM ] A processor joined or left the membership and a new membership was formed.
Jul  5 21:06:59 pcmk-4 corosync[23346]:   [TOTEM ] A processor joined or left the membership and a new membership was formed.
Jul  5 21:06:59 pcmk-4 corosync[23346]:   [QUORUM] Members[2]: 101 104
Jul  5 21:06:59 pcmk-4 corosync[23346]:   [QUORUM] Members[2]: 101 104
Jul  5 21:06:59 pcmk-4 corosync[23346]:   [MAIN  ] Completed service synchronization, ready to provide service.
Jul  5 21:07:05 pcmk-4 corosync[23346]:   [SERV  ] Unloading all Corosync service engines.
Jul  5 21:07:05 pcmk-4 corosync[23346]:   [SERV  ] Service engine unloaded: corosync extended virtual synchrony service
Jul  5 21:07:05 pcmk-4 corosync[23346]:   [SERV  ] Service engine unloaded: corosync configuration service
Jul  5 21:07:05 pcmk-4 corosync[23346]:   [SERV  ] Service engine unloaded: corosync cluster closed process group service v1.01
Jul  5 21:07:05 pcmk-4 corosync[23346]:   [SERV  ] Service engine unloaded: corosync cluster config database access v1.01
Jul  5 21:07:05 pcmk-4 corosync[23346]:   [SERV  ] Service engine unloaded: corosync profile loading service
Jul  5 21:07:05 pcmk-4 corosync[23346]:   [SERV  ] Service engine unloaded: corosync CMAN membership service 2.90
Jul  5 21:07:05 pcmk-4 corosync[23346]:   [SERV  ] Service engine unloaded: corosync cluster quorum service v0.1
Jul  5 21:07:05 pcmk-4 corosync[23346]:   [MAIN  ] Corosync Cluster Engine exiting with status 0 at main.c:170.
Comment 2 Andrew Beekhof 2010-07-06 09:13:16 EDT
To trigger manually, you'd probably bring a 4 node cman cluster up and then run "cman_tool leave" on three of them at essentially the same time.

Thats a pretty good reflection of what happened here.
Comment 3 Angus Salkeld 2010-07-10 00:35:50 EDT
Andrew I have tried quite a bit to reproduce, with no luck.

I have tried:
1) using cssh (Cluster Ssh) to repeatedly run cman_tool (join/leave)
   I ran this 50 times

2) use CTS to write a test that sends 10000 cpg messages and stop the corosync
   processs on the other nodes whilst that node transmits.
   I ran this over night ~300 times with a maximum shutdown time of 10sec.

3) use cpgbench on one node while using cssh to shutdown the other nodes (at the same time).

If you can reproduce easily can you "killall -SIGSEGV corosync" one the node
that is taking forever to shutdown?

I will continue trying to reproduce.

-Angus
Comment 4 David Teigland 2010-07-12 13:08:10 EDT
I have no idea if this is related, but it's reminiscent of this problem, which wasn't as extreme:
https://lists.linux-foundation.org/pipermail/openais/2009-April/011566.html

I'm not sure how that was addressed.  Here's a cman commit from about that time that may be related (commit message is unfortunately not very helpful): 
http://git.fedorahosted.org/git/?p=cluster.git;a=commitdiff;h=38810c9725b5295c2d97ae5034c421b65a5d9863

There may be an openais change related to this, but I can't easily browse those commits with svn :-(
Comment 5 Angus Salkeld 2010-07-12 23:05:01 EDT
Thanks David, that helps to have some context.

Here is something that looks fishy (just a theory):

The lockup/timeout is:
Jul  5 20:49:10 
Jul  5 21:06:59
(60*60*21 + 60*6 + 59) - (60*60*20 + 60*49 + 10) == 1069 seconds

from pacemaker/mcp/corosync.c

#define cs_repeat(counter, max, code) do {		\
	code;						\
	if(rc == CS_ERR_TRY_AGAIN) {			\
	    counter++;					\
	    sleep(counter);				\
	}						\
    } while(rc == CS_ERR_TRY_AGAIN && counter < max)

So [manual] backtrace

pcmk_cpg_membership
> update_process_peers();
>> send_cpg_message()
>>> cs_repeat(retries, 30, rc = cpg_mcast_joined(cpg_handle, CPG_TYPE_AGREED, iov, 1));

So say 4 nodes, loose 2 nodes (now lost quorum), we get the config change.
We try send something, and as we don't have quorum we get TRY_AGAIN
it only times out in 1 + 2 + 3 ... 30 == 528 secs
we then loose another node (our selves) and another 528 secs

528 + 528 == 1056, which is really close to the 1069 above.

This blocking in send_cpg_message() might be preventing a cfg_shutdown callback.

I am not a fan of cs_repeat(), blocking for such a long time is bound to cause
problems. Can you schedule a re-send using a timer so that the poll loop can call your dispatch routine?

-Angus
Comment 6 David Teigland 2010-07-13 09:43:01 EDT
Maybe I'm not following what you're saying, but the loss of quorum should not prevent sending messages (TRY_AGAIN).  We fixed that a very long time ago.  It appears that the fix may have been CS_LIB_ALLOW_INQUORATE, which implies that sometimes (or for some services), things may not fully operate when quorum is lost.  That would be bad, at least for the apps that I maintain (and I suspect for Andrew's as well).
Comment 7 David Teigland 2010-07-13 10:22:37 EDT
Now that I've looked at this more closely, I fairly sure I've seen the same thing a few times.  I believe I showed the /var/log/messages to Steve once... I'll see if I can dig them out from somewhere.

I don't think it was related to quorum, corosync just seemed to take forever to complete the transition.  During a "transition", send does return TRY_AGAIN, which is, of course, why we all have to put loops around cpg_mcast_joined.  For better or worse (in this case probably better), I don't put any limit around most of my send loops (below).  I regularly see log entries
"cpg_mcast_joined retry x y" and on occasion they are lots of them.

static int _send_message(cpg_handle_t h, void *buf, int len, int type)
{
        struct iovec iov;
        cpg_error_t error;
        int retries = 0;

        iov.iov_base = buf;
        iov.iov_len = len;

 retry:
        error = cpg_mcast_joined(h, CPG_TYPE_AGREED, &iov, 1);
        if (error == CPG_ERR_TRY_AGAIN) {
                retries++;
                usleep(1000);
                if (!(retries % 100))
                        log_error("cpg_mcast_joined retry %d %s",
                                   retries, msg_name(type));
                goto retry;
        }
        if (error != CPG_OK) {
                log_error("cpg_mcast_joined error %d handle %llx %s",
                          error, (unsigned long long)h, msg_name(type));
                return -1;
        }

        if (retries)
                log_debug("cpg_mcast_joined retried %d %s",
                          retries, msg_name(type));

        return 0;
}
Comment 8 Angus Salkeld 2010-07-13 11:01:17 EDT
Hi Dave, your right - nothing to do with quorum.

The TRY_AGAIN is happing because we are not getting a sync_completed.
sending_allowed = /* bla bla */
		(sync_in_process == 0)));

sync_in_process is "1" when we get into this situation.

Here is the last bit of my log:

Jul 13 23:53:22 n1 corosync[24930]:   [MAIN  ] Completed service synchronization, ready to provide service.
Jul 13 23:53:41 n1 cpg_test_agent[24992]: called cpg_join()!
Jul 13 23:53:41 n1 cpg_test_agent[24992]: config_change_callback(0) 0:GenStopAllButOne_whitetank_5min joined
Jul 13 23:53:41 n1 cpg_test_agent[24992]: do_command cfg_initialize() called!
Jul 13 23:53:41 n1 cpg_test_agent[24992]: config_change_callback(0) 0:GenStopAllButOne_whitetank_5min joined
Jul 13 23:53:41 n1 cpg_test_agent[24992]: config_change_callback(0) 0:GenStopAllButOne_whitetank_5min joined
Jul 13 23:53:41 n1 cpg_test_agent[24992]: config_change_callback(0) -65536:GenStopAllButOne_whitetank_5min joined
Jul 13 23:53:42 n1 votequorum_test_agent[25118]: VQ notification quorate: 1
Jul 13 23:53:42 n1 votequorum_test_agent[25118]: NQ notification quorate: 1
Jul 13 23:53:42 n1 votequorum_test_agent[25118]: VQ notification quorate: 1
Jul 13 23:53:42 n1 votequorum_test_agent[25118]: NQ notification quorate: 1
Jul 13 23:53:42 n1 corosync[24930]:   [MAIN  ] !sending_allowed: Q:1 AQ:1 FC:1 RS:0 SY:0
Jul 13 23:53:42 n1 votequorum_test_agent[25118]: VQ notification quorate: 1
Jul 13 23:53:42 n1 votequorum_test_agent[25118]: NQ notification quorate: 1
Jul 13 23:53:42 n1 votequorum_test_agent[25118]: VQ notification quorate: 0
Jul 13 23:53:42 n1 votequorum_test_agent[25118]: NQ notification quorate: 0
Jul 13 23:53:43 n1 corosync[24930]:   [CPG   ] downlist received left_list: 1
Jul 13 23:53:43 n1 corosync[24930]:   [CPG   ] downlist received left_list: 1
Jul 13 23:53:43 n1 corosync[24930]:   [CPG   ] chosen downlist from node r(0) ip(192.168.100.11) 
Jul 13 23:53:43 n1 corosync[24930]:   [TST2  ] tst_sv2_sync_init_v2 r(0) ip(192.168.100.11) 
Jul 13 23:53:43 n1 corosync[24930]:   [TST2  ] sync: node left r(0) ip(192.168.100.13) 
Jul 13 23:53:43 n1 corosync[24930]:   [VOTEQ ] quorum lost, blocking activity
Jul 13 23:53:43 n1 corosync[24930]:   [QUORUM] This node is within the non-primary component and will NOT provide any services.
Jul 13 23:53:43 n1 corosync[24930]:   [QUORUM] Members[2]: 191146176 207923392
Jul 13 23:53:43 n1 corosync[24930]:   [TST2  ] Member left: r(0) ip(192.168.100.14) 
Jul 13 23:53:43 n1 corosync[24930]:   [TOTEM ] A processor joined or left the membership and a new membership was formed.
Jul 13 23:53:43 n1 corosync[24930]:   [CPG   ] downlist received left_list: 1
Jul 13 23:53:43 n1 corosync[24930]:   [CPG   ] downlist received left_list: 1
Jul 13 23:53:43 n1 corosync[24930]:   [CPG   ] chosen downlist from node r(0) ip(192.168.100.11) 
Jul 13 23:53:43 n1 corosync[24930]:   [MAIN  ] !sending_allowed: Q:0 AQ:1 FC:1 RS:1 SY:1

Then tonnes of "!sending_allowed". Note SY:1 is sync_in_process == 1
I'll attach the output from corosync-blackbox.
Comment 9 Angus Salkeld 2010-07-13 11:05:45 EDT
Created attachment 431490 [details]
output of corosync-blackbox
Comment 11 Angus Salkeld 2010-07-15 23:04:25 EDT
Created attachment 432274 [details]
fix to make sync abort properly

1) sync_callbacks.sync_abort can be null.
2) sync_processing is set to 0 after syncv1 is done.
   Then syncv2 processing is down. If we get a config change
   after syncv1 is down, but before syncv2 is done then it won't
   get aborted.
Comment 12 Angus Salkeld 2010-07-15 23:09:31 EDT
I have committed a CTS patch upstream that reproduces the error.

During these test runs I have needed some IPC fixes as well, Steve
says he will submit some IPC fixes soon.

The attached patch is pending approval on the mailing list.

-Angus
Comment 13 Angus Salkeld 2010-07-18 18:03:27 EDT
Created attachment 432738 [details]
patch: reset ring id in the sync abort

patch: reset ring id in the sync abort
Comment 14 Andrew Beekhof 2010-07-24 11:15:38 EDT
Just tried with -15 and the problem is infinitely worse.
Pacemaker can't even start up anymore.

[root@pcmk-1 ~]# rpm -qa | grep corosync
corosync-1.2.3-15.el6.x86_64
corosynclib-1.2.3-15.el6.x86_64

Here are the logs.

Jul 24 11:33:37 pcmk-1 corosync[8963]:   [MAIN  ] Corosync Cluster Engine ('1.2.3'): started and ready to provide service.
Jul 24 11:33:37 pcmk-1 corosync[8963]:   [MAIN  ] Corosync built-in features: nss rdma
Jul 24 11:33:37 pcmk-1 corosync[8963]:   [MAIN  ] Successfully read main configuration file '/etc/corosync/corosync.conf'.
Jul 24 11:33:37 pcmk-1 corosync[8963]:   [TOTEM ] Token Timeout (5000 ms) retransmit timeout (490 ms)
Jul 24 11:33:37 pcmk-1 corosync[8963]:   [TOTEM ] token hold (382 ms) retransmits before loss (10 retrans)
Jul 24 11:33:37 pcmk-1 corosync[8963]:   [TOTEM ] join (1000 ms) send_join (45 ms) consensus (7500 ms) merge (200 ms)
Jul 24 11:33:37 pcmk-1 corosync[8963]:   [TOTEM ] downcheck (1000 ms) fail to recv const (50 msgs)
Jul 24 11:33:37 pcmk-1 corosync[8963]:   [TOTEM ] seqno unchanged const (30 rotations) Maximum network MTU 1402
Jul 24 11:33:37 pcmk-1 corosync[8963]:   [TOTEM ] window size per rotation (50 messages) maximum messages per rotation (20 messages)
Jul 24 11:33:37 pcmk-1 corosync[8963]:   [TOTEM ] send threads (0 threads)
Jul 24 11:33:37 pcmk-1 corosync[8963]:   [TOTEM ] RRP token expired timeout (490 ms)
Jul 24 11:33:37 pcmk-1 corosync[8963]:   [TOTEM ] RRP token problem counter (2000 ms)
Jul 24 11:33:37 pcmk-1 corosync[8963]:   [TOTEM ] RRP threshold (10 problem count)
Jul 24 11:33:37 pcmk-1 corosync[8963]:   [TOTEM ] RRP mode set to none.
Jul 24 11:33:37 pcmk-1 corosync[8963]:   [TOTEM ] heartbeat_failures_allowed (0)
Jul 24 11:33:37 pcmk-1 corosync[8963]:   [TOTEM ] max_network_delay (50 ms)
Jul 24 11:33:37 pcmk-1 corosync[8963]:   [TOTEM ] HeartBeat is Disabled. To enable set heartbeat_failures_allowed > 0
Jul 24 11:33:37 pcmk-1 corosync[8963]:   [TOTEM ] Initializing transport (UDP/IP).
Jul 24 11:33:37 pcmk-1 corosync[8963]:   [TOTEM ] Initializing transmit/receive security: libtomcrypt SOBER128/SHA1HMAC (mode 0).
Jul 24 11:33:37 pcmk-1 corosync[8963]:   [IPC   ] you are using ipc api v2
Jul 24 11:33:37 pcmk-1 corosync[8963]:   [TOTEM ] Receive multicast socket recv buffer size (249856 bytes).
Jul 24 11:33:37 pcmk-1 corosync[8963]:   [TOTEM ] Transmit multicast socket send buffer size (249856 bytes).
Jul 24 11:33:37 pcmk-1 corosync[8963]:   [TOTEM ] The network interface [192.168.122.101] is now up.
Jul 24 11:33:37 pcmk-1 corosync[8963]:   [TOTEM ] Created or loaded sequence id 0.192.168.122.101 for this ring.
Jul 24 11:33:37 pcmk-1 corosync[8963]:   [CMAN  ] CMAN starting
Jul 24 11:33:37 pcmk-1 corosync[8963]:   [CMAN  ] memb: Got node pcmk-1 from ccs (id=101, votes=1)
Jul 24 11:33:37 pcmk-1 corosync[8963]:   [CMAN  ] memb: add_new_node: pcmk-1, (id=101, votes=1) newalloc=1
Jul 24 11:33:37 pcmk-1 corosync[8963]:   [CMAN  ] memb: Got node pcmk-2 from ccs (id=102, votes=1)
Jul 24 11:33:37 pcmk-1 corosync[8963]:   [CMAN  ] memb: add_new_node: pcmk-2, (id=102, votes=1) newalloc=1
Jul 24 11:33:37 pcmk-1 corosync[8963]:   [CMAN  ] memb: Got node pcmk-3 from ccs (id=103, votes=1)
Jul 24 11:33:37 pcmk-1 corosync[8963]:   [CMAN  ] memb: add_new_node: pcmk-3, (id=103, votes=1) newalloc=1
Jul 24 11:33:37 pcmk-1 corosync[8963]:   [CMAN  ] memb: Got node pcmk-4 from ccs (id=104, votes=1)
Jul 24 11:33:37 pcmk-1 corosync[8963]:   [CMAN  ] memb: add_new_node: pcmk-4, (id=104, votes=1) newalloc=1
Jul 24 11:33:37 pcmk-1 corosync[8963]:   [CMAN  ] memb: add_new_node: pcmk-1, (id=101, votes=1) newalloc=0
Jul 24 11:33:37 pcmk-1 corosync[8963]:   [CMAN  ] CMAN 3.0.12 (built Jul 23 2010 03:52:52) started
Jul 24 11:33:37 pcmk-1 corosync[8963]:   [SERV  ] Service engine loaded: corosync CMAN membership service 2.90
Jul 24 11:33:37 pcmk-1 corosync[8963]:   [SERV  ] Service engine loaded: corosync extended virtual synchrony service
Jul 24 11:33:37 pcmk-1 corosync[8963]:   [SERV  ] Service engine loaded: corosync configuration service
Jul 24 11:33:37 pcmk-1 corosync[8963]:   [SERV  ] Service engine loaded: corosync cluster closed process group service v1.01
Jul 24 11:33:37 pcmk-1 corosync[8963]:   [SERV  ] Service engine loaded: corosync cluster config database access v1.01
Jul 24 11:33:37 pcmk-1 corosync[8963]:   [SERV  ] Service engine loaded: corosync profile loading service
Jul 24 11:33:37 pcmk-1 corosync[8963]:   [QUORUM] Using quorum provider quorum_cman
Jul 24 11:33:37 pcmk-1 corosync[8963]:   [SERV  ] Service engine loaded: corosync cluster quorum service v0.1
Jul 24 11:33:37 pcmk-1 corosync[8963]:   [MAIN  ] Compatibility mode set to whitetank.  Using V1 and V2 of the synchronization engine.
Jul 24 11:33:37 pcmk-1 corosync[8963]:   [TOTEM ] entering GATHER state from 15.
Jul 24 11:33:39 pcmk-1 pacemakerd: [8984]: info: Invoked: pacemakerd 
Jul 24 11:33:39 pcmk-1 pacemakerd: [8984]: info: crm_log_init_worker: Changed active directory to /var/lib/heartbeat/cores/root
Jul 24 11:33:39 pcmk-1 corosync[8963]:   [CONFDB] lib_init_fn: conn=0x1578f90
Jul 24 11:33:39 pcmk-1 pacemakerd: [8984]: info: read_config: Reading configure
Jul 24 11:33:39 pcmk-1 pacemakerd: [8984]: info: config_find_next: Processing additional service options...
Jul 24 11:33:39 pcmk-1 pacemakerd: [8984]: info: get_config_opt: Found 'corosync_cman' for option: name
Jul 24 11:33:39 pcmk-1 pacemakerd: [8984]: info: config_find_next: No additional configuration supplied for: service
Jul 24 11:33:39 pcmk-1 pacemakerd: [8984]: info: config_find_next: Processing additional logging options...
Jul 24 11:33:39 pcmk-1 pacemakerd: [8984]: info: get_config_opt: Found 'on' for option: debug
Jul 24 11:33:39 pcmk-1 pacemakerd: [8984]: info: get_config_opt: Found 'yes' for option: to_logfile
Jul 24 11:33:39 pcmk-1 pacemakerd: [8984]: info: get_config_opt: Found '/var/log/corosync.log' for option: logfile
Jul 24 11:33:39 pcmk-1 pacemakerd: [8984]: info: get_config_opt: Found 'yes' for option: to_syslog
Jul 24 11:33:39 pcmk-1 pacemakerd: [8984]: info: get_config_opt: Found 'daemon' for option: syslog_facility
Jul 24 11:33:39 pcmk-1 pacemakerd: [8984]: info: config_find_next: Processing additional quorum options...
Jul 24 11:33:39 pcmk-1 pacemakerd: [8984]: info: get_config_opt: Found 'quorum_cman' for option: provider
Jul 24 11:33:39 pcmk-1 corosync[8963]:   [CONFDB] exit_fn for conn=0x1578f90
Jul 24 11:33:39 pcmk-1 pacemakerd: [8986]: info: crm_log_init_worker: Changed active directory to /var/lib/heartbeat/cores/root
Jul 24 11:33:39 pcmk-1 pacemakerd: [8986]: info: main: Starting Pacemaker 1.1.2 (Build: f059ec7ced7a86f18e5490b67ebf4a0b963bccfe):  manpages docbook-manpages publican ncurses cman cs-quorum corosync snmp libesmtp
Jul 24 11:33:39 pcmk-1 pacemakerd: [8986]: info: main: Maximum core file size is: 18446744073709551615
Jul 24 11:33:39 pcmk-1 pacemakerd: [8986]: debug: cluster_connect_cfg: Our nodeid: 101
Jul 24 11:33:39 pcmk-1 pacemakerd: [8986]: debug: cluster_connect_cfg: Retrying operation after 1s
Jul 24 11:33:40 pcmk-1 pacemakerd: [8986]: debug: cluster_connect_cfg: Retrying operation after 2s
Jul 24 11:33:42 pcmk-1 pacemakerd: [8986]: debug: cluster_connect_cfg: Retrying operation after 3s
Jul 24 11:33:45 pcmk-1 pacemakerd: [8986]: debug: cluster_connect_cfg: Retrying operation after 4s
Jul 24 11:33:49 pcmk-1 pacemakerd: [8986]: debug: cluster_connect_cfg: Retrying operation after 5s
Jul 24 11:33:50 pcmk-1 corosync[8963]:   [TOTEM ] The consensus timeout expired.
Jul 24 11:33:50 pcmk-1 corosync[8963]:   [TOTEM ] entering GATHER state from 3.
Jul 24 11:33:54 pcmk-1 pacemakerd: [8986]: debug: cluster_connect_cfg: Retrying operation after 6s
Jul 24 11:34:00 pcmk-1 pacemakerd: [8986]: debug: cluster_connect_cfg: Retrying operation after 7s
Jul 24 11:34:02 pcmk-1 corosync[8963]:   [TOTEM ] The consensus timeout expired.
Jul 24 11:34:02 pcmk-1 corosync[8963]:   [TOTEM ] entering GATHER state from 3.
Jul 24 11:34:07 pcmk-1 pacemakerd: [8986]: debug: cluster_connect_cfg: Retrying operation after 8s
Jul 24 11:34:15 pcmk-1 corosync[8963]:   [TOTEM ] The consensus timeout expired.
Jul 24 11:34:15 pcmk-1 corosync[8963]:   [TOTEM ] entering GATHER state from 3.
Jul 24 11:34:15 pcmk-1 pacemakerd: [8986]: debug: cluster_connect_cfg: Retrying operation after 9s
Jul 24 11:34:24 pcmk-1 pacemakerd: [8986]: debug: cluster_connect_cfg: Retrying operation after 10s
Jul 24 11:34:27 pcmk-1 corosync[8963]:   [TOTEM ] The consensus timeout expired.
Jul 24 11:34:27 pcmk-1 corosync[8963]:   [TOTEM ] entering GATHER state from 3.
Jul 24 11:34:34 pcmk-1 pacemakerd: [8986]: debug: cluster_connect_cfg: Retrying operation after 11s
Jul 24 11:34:40 pcmk-1 corosync[8963]:   [TOTEM ] The consensus timeout expired.
Jul 24 11:34:40 pcmk-1 corosync[8963]:   [TOTEM ] entering GATHER state from 3.
Jul 24 11:34:45 pcmk-1 pacemakerd: [8986]: debug: cluster_connect_cfg: Retrying operation after 12s
Jul 24 11:34:52 pcmk-1 corosync[8963]:   [TOTEM ] The consensus timeout expired.
Jul 24 11:34:52 pcmk-1 corosync[8963]:   [TOTEM ] entering GATHER state from 3.
Jul 24 11:34:57 pcmk-1 pacemakerd: [8986]: debug: cluster_connect_cfg: Retrying operation after 13s
Jul 24 11:35:05 pcmk-1 corosync[8963]:   [TOTEM ] The consensus timeout expired.
Jul 24 11:35:05 pcmk-1 corosync[8963]:   [TOTEM ] entering GATHER state from 3.
Jul 24 11:35:10 pcmk-1 pacemakerd: [8986]: debug: cluster_connect_cfg: Retrying operation after 14s
Jul 24 11:35:17 pcmk-1 corosync[8963]:   [TOTEM ] The consensus timeout expired.
Jul 24 11:35:17 pcmk-1 corosync[8963]:   [TOTEM ] entering GATHER state from 3.
Jul 24 11:35:24 pcmk-1 pacemakerd: [8986]: debug: cluster_connect_cfg: Retrying operation after 15s
Jul 24 11:35:30 pcmk-1 corosync[8963]:   [TOTEM ] The consensus timeout expired.
Jul 24 11:35:30 pcmk-1 corosync[8963]:   [TOTEM ] entering GATHER state from 3.
Jul 24 11:35:39 pcmk-1 pacemakerd: [8986]: debug: cluster_connect_cfg: Retrying operation after 16s
Jul 24 11:35:42 pcmk-1 corosync[8963]:   [TOTEM ] The consensus timeout expired.
Jul 24 11:35:42 pcmk-1 corosync[8963]:   [TOTEM ] entering GATHER state from 3.
Jul 24 11:35:55 pcmk-1 corosync[8963]:   [TOTEM ] The consensus timeout expired.
Jul 24 11:35:55 pcmk-1 corosync[8963]:   [TOTEM ] entering GATHER state from 3.
Jul 24 11:35:55 pcmk-1 pacemakerd: [8986]: debug: cluster_connect_cfg: Retrying operation after 17s
Jul 24 11:36:08 pcmk-1 corosync[8963]:   [TOTEM ] The consensus timeout expired.
Jul 24 11:36:08 pcmk-1 corosync[8963]:   [TOTEM ] entering GATHER state from 3.
Jul 24 11:36:12 pcmk-1 pacemakerd: [8986]: debug: cluster_connect_cfg: Retrying operation after 18s
Jul 24 11:36:20 pcmk-1 corosync[8963]:   [TOTEM ] The consensus timeout expired.
Jul 24 11:36:20 pcmk-1 corosync[8963]:   [TOTEM ] entering GATHER state from 3.
Jul 24 11:36:30 pcmk-1 pacemakerd: [8986]: debug: cluster_connect_cfg: Retrying operation after 19s
Jul 24 11:36:33 pcmk-1 corosync[8963]:   [TOTEM ] The consensus timeout expired.
Jul 24 11:36:33 pcmk-1 corosync[8963]:   [TOTEM ] entering GATHER state from 3.
Jul 24 11:36:45 pcmk-1 corosync[8963]:   [TOTEM ] The consensus timeout expired.
Jul 24 11:36:45 pcmk-1 corosync[8963]:   [TOTEM ] entering GATHER state from 3.
Jul 24 11:36:49 pcmk-1 pacemakerd: [8986]: debug: cluster_connect_cfg: Retrying operation after 20s
Jul 24 11:36:58 pcmk-1 corosync[8963]:   [TOTEM ] The consensus timeout expired.
Jul 24 11:36:58 pcmk-1 corosync[8963]:   [TOTEM ] entering GATHER state from 3.
Jul 24 11:37:09 pcmk-1 pacemakerd: [8986]: debug: cluster_connect_cfg: Retrying operation after 21s
Jul 24 11:37:10 pcmk-1 corosync[8963]:   [TOTEM ] The consensus timeout expired.
Jul 24 11:37:10 pcmk-1 corosync[8963]:   [TOTEM ] entering GATHER state from 3.
Jul 24 11:37:23 pcmk-1 corosync[8963]:   [TOTEM ] The consensus timeout expired.
Jul 24 11:37:23 pcmk-1 corosync[8963]:   [TOTEM ] entering GATHER state from 3.
Jul 24 11:37:30 pcmk-1 pacemakerd: [8986]: debug: cluster_connect_cfg: Retrying operation after 22s
Jul 24 11:37:35 pcmk-1 corosync[8963]:   [TOTEM ] The consensus timeout expired.
Jul 24 11:37:35 pcmk-1 corosync[8963]:   [TOTEM ] entering GATHER state from 3.
Jul 24 11:37:48 pcmk-1 corosync[8963]:   [TOTEM ] The consensus timeout expired.
Jul 24 11:37:48 pcmk-1 corosync[8963]:   [TOTEM ] entering GATHER state from 3.
Jul 24 11:37:52 pcmk-1 pacemakerd: [8986]: debug: cluster_connect_cfg: Retrying operation after 23s
Jul 24 11:38:00 pcmk-1 corosync[8963]:   [TOTEM ] The consensus timeout expired.
Jul 24 11:38:01 pcmk-1 corosync[8963]:   [TOTEM ] entering GATHER state from 3.
Jul 24 11:38:13 pcmk-1 corosync[8963]:   [TOTEM ] The consensus timeout expired.
Jul 24 11:38:13 pcmk-1 corosync[8963]:   [TOTEM ] entering GATHER state from 3.
Jul 24 11:38:15 pcmk-1 pacemakerd: [8986]: debug: cluster_connect_cfg: Retrying operation after 24s
Jul 24 11:38:26 pcmk-1 corosync[8963]:   [TOTEM ] The consensus timeout expired.
Jul 24 11:38:26 pcmk-1 corosync[8963]:   [TOTEM ] entering GATHER state from 3.
Jul 24 11:38:38 pcmk-1 corosync[8963]:   [TOTEM ] The consensus timeout expired.
Jul 24 11:38:38 pcmk-1 corosync[8963]:   [TOTEM ] entering GATHER state from 3.
Jul 24 11:38:39 pcmk-1 pacemakerd: [8986]: debug: cluster_connect_cfg: Retrying operation after 25s
Jul 24 11:38:51 pcmk-1 corosync[8963]:   [TOTEM ] The consensus timeout expired.
Jul 24 11:38:51 pcmk-1 corosync[8963]:   [TOTEM ] entering GATHER state from 3.
Jul 24 11:39:03 pcmk-1 corosync[8963]:   [TOTEM ] The consensus timeout expired.
Jul 24 11:39:03 pcmk-1 corosync[8963]:   [TOTEM ] entering GATHER state from 3.
Jul 24 11:39:04 pcmk-1 pacemakerd: [8986]: debug: cluster_connect_cfg: Retrying operation after 26s
Jul 24 11:39:16 pcmk-1 corosync[8963]:   [TOTEM ] The consensus timeout expired.
Jul 24 11:39:16 pcmk-1 corosync[8963]:   [TOTEM ] entering GATHER state from 3.
Jul 24 11:39:28 pcmk-1 corosync[8963]:   [TOTEM ] The consensus timeout expired.
Jul 24 11:39:28 pcmk-1 corosync[8963]:   [TOTEM ] entering GATHER state from 3.
Jul 24 11:39:30 pcmk-1 pacemakerd: [8986]: debug: cluster_connect_cfg: Retrying operation after 27s
Jul 24 11:39:41 pcmk-1 corosync[8963]:   [TOTEM ] The consensus timeout expired.
Jul 24 11:39:41 pcmk-1 corosync[8963]:   [TOTEM ] entering GATHER state from 3.
Jul 24 11:39:53 pcmk-1 corosync[8963]:   [TOTEM ] The consensus timeout expired.
Jul 24 11:39:53 pcmk-1 corosync[8963]:   [TOTEM ] entering GATHER state from 3.
Jul 24 11:39:57 pcmk-1 pacemakerd: [8986]: debug: cluster_connect_cfg: Retrying operation after 28s
Jul 24 11:40:06 pcmk-1 corosync[8963]:   [TOTEM ] The consensus timeout expired.
Jul 24 11:40:06 pcmk-1 corosync[8963]:   [TOTEM ] entering GATHER state from 3.
Jul 24 11:40:19 pcmk-1 corosync[8963]:   [TOTEM ] The consensus timeout expired.
Jul 24 11:40:19 pcmk-1 corosync[8963]:   [TOTEM ] entering GATHER state from 3.
Jul 24 11:40:25 pcmk-1 pacemakerd: [8986]: debug: cluster_connect_cfg: Retrying operation after 29s
Jul 24 11:40:31 pcmk-1 corosync[8963]:   [TOTEM ] The consensus timeout expired.
Jul 24 11:40:31 pcmk-1 corosync[8963]:   [TOTEM ] entering GATHER state from 3.
Jul 24 11:40:44 pcmk-1 corosync[8963]:   [TOTEM ] The consensus timeout expired.
Jul 24 11:40:44 pcmk-1 corosync[8963]:   [TOTEM ] entering GATHER state from 3.
Jul 24 11:40:54 pcmk-1 pacemakerd: [8986]: debug: cluster_connect_cfg: Retrying operation after 30s
Jul 24 11:40:56 pcmk-1 corosync[8963]:   [TOTEM ] The consensus timeout expired.
Jul 24 11:40:56 pcmk-1 corosync[8963]:   [TOTEM ] entering GATHER state from 3.
Jul 24 11:41:09 pcmk-1 corosync[8963]:   [TOTEM ] The consensus timeout expired.
Jul 24 11:41:09 pcmk-1 corosync[8963]:   [TOTEM ] entering GATHER state from 3.
Jul 24 11:41:21 pcmk-1 corosync[8963]:   [TOTEM ] The consensus timeout expired.
Jul 24 11:41:21 pcmk-1 corosync[8963]:   [TOTEM ] entering GATHER state from 3.
Jul 24 11:41:24 pcmk-1 pacemakerd: [8986]: ERROR: cluster_connect_cfg: corosync cfg stack_track error 6
Comment 15 Andrew Beekhof 2010-07-24 11:21:02 EDT
Doh!
Seems iptables re-enabled itself again after the last update.
Sorry for the noise :-(
Comment 16 Andrew Beekhof 2010-07-26 09:39:55 EDT
I can confirm the update resolves the issue for Pacemaker (which was a very reliable reproducer).
I ran tests continuously over the weekend and didn't hit it once.
Comment 18 Dean Jansa 2010-08-30 15:06:38 EDT
Tested with suggestion in comment2.  I used a 5 node cluster, had 4 leave at the same time.  
.
.
Aug 30 14:04:58 marathon-01 kernel: dlm: closing connection to node 2
Aug 30 14:04:58 marathon-01 kernel: dlm: closing connection to node 3
Aug 30 14:04:58 marathon-01 kernel: dlm: closing connection to node 5
Aug 30 14:04:58 marathon-01 corosync[19026]:   [TOTEM ] A processor joined or left the membership and a new membership was formed.
Aug 30 14:04:58 marathon-01 corosync[19026]:   [QUORUM] Members[1]: 1
Aug 30 14:04:58 marathon-01 corosync[19026]:   [TOTEM ] A processor joined or left the membership and a new membership was formed.
Aug 30 14:04:58 marathon-01 corosync[19026]:   [CPG   ] downlist received left_list: 4
Aug 30 14:04:58 marathon-01 corosync[19026]:   [CPG   ] chosen downlist from node r(0) ip(10.15.89.71) 
Aug 30 14:04:58 marathon-01 corosync[19026]:   [MAIN  ] Completed service synchronization, ready to provide service.
Aug 30 14:04:58 marathon-01 kernel: dlm: closing connection to node 4


Cluster has one remaining node, which sees the "right" state.

In addition, dev testing from comment 16 verifies bug with pacemaker test (a very
reliable reproducer).
Comment 19 releng-rhel@redhat.com 2010-11-10 17:07:21 EST
Red Hat Enterprise Linux 6.0 is now available and should resolve
the problem described in this bug report. This report is therefore being closed
with a resolution of CURRENTRELEASE. You may reopen this bug report if the
solution does not work for you.

Note You need to log in before you can comment on or make changes to this bug.