Bug 788742

Summary:	cpg_join() fails with rc = 23 (queue full) at node startup
Product:	[Fedora] Fedora	Reporter:	Andrew Beekhof <abeekhof>
Component:	corosync	Assignee:	Angus Salkeld <asalkeld>
Status:	CLOSED CURRENTRELEASE	QA Contact:	Fedora Extras Quality Assurance <extras-qa>
Severity:	high	Docs Contact:
Priority:	unspecified
Version:	rawhide	CC:	agk, asalkeld, fdinitto, jfriesse, sdake
Target Milestone:	---
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2012-06-12 23:56:19 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Andrew Beekhof 2012-02-08 22:38:34 UTC

Description of problem:

$subject
Given the timing (node startup), the queue full error seems unlikely to be legitimate. 

Version-Release number of selected component (if applicable):

Source RPM  : corosync-1.99.2-1.fc17.src.rpm
Source RPM  : libqb-0.10.0-1.fc17.src.rpm

How reproducible:

Frequent and fatal

Steps to Reproduce:
1.  Corosync is started followed by Pacemaker.
2.  Be unlucky, or possibly have another node join the cluster at the same time
3.
  
Actual results:

Atrrd starts trying to connect at 05:08:39 and retries until 05:16:26 before giving up.

Expected results:

Connection eventually succeeds.

Additional info:

The code from init_cpg_connection()
    retries = 0;
    cs_repeat(retries, 30, rc = cpg_join(pcmk_cpg_handle, &pcmk_cpg_group));
    if (rc != CS_OK) {
        crm_err("Could not join the CPG group '%s': %d", crm_system_name, rc);
        goto bail;
    }

The cs_repeat macro:

#define cs_repeat(counter, max, code) do {		\
	code;						\
	if(rc == CS_ERR_TRY_AGAIN || rc == CS_ERR_QUEUE_FULL) {  \
	    counter++;					\
	    crm_debug("Retrying operation after %ds", counter);	\
	    sleep(counter);				\
	} else {                                        \
            break;                                      \
        }                                               \
    } while(counter < max)

Note that both of the following calls have succeeded before cpg_join() is called.
    cs_repeat(retries, 30, rc = cpg_initialize(&pcmk_cpg_handle, &cpg_callbacks));
    cs_repeat(retries, 30, rc = cpg_local_get(pcmk_cpg_handle, (unsigned int *)nodeid));

Feb 08 05:08:28 notice  [SERV  ] Unloading all Corosync service engines.
Feb 08 05:08:28 info    [QB    ] withdrawing server sockets
Feb 08 05:08:28 notice  [SERV  ] Service engine unloaded: corosync vote quorum service v1.0
Feb 08 05:08:28 info    [QB    ] withdrawing server sockets
Feb 08 05:08:28 notice  [SERV  ] Service engine unloaded: corosync extended virtual synchrony service
Feb 08 05:08:28 info    [QB    ] withdrawing server sockets
Feb 08 05:08:28 notice  [SERV  ] Service engine unloaded: corosync configuration service
Feb 08 05:08:28 info    [QB    ] withdrawing server sockets
Feb 08 05:08:28 notice  [SERV  ] Service engine unloaded: corosync cluster closed process group service v1.01
Feb 08 05:08:28 info    [QB    ] withdrawing server sockets
Feb 08 05:08:28 notice  [SERV  ] Service engine unloaded: corosync cluster quorum service v0.1
Feb 08 05:08:28 info    [QB    ] withdrawing server sockets
Feb 08 05:08:28 notice  [SERV  ] Service engine unloaded: corosync profile loading service
Feb 08 05:08:29 info    [QB    ] withdrawing server sockets
Feb 08 05:08:29 notice  [SERV  ] Service engine unloaded: corosync configuration map access
Feb 08 05:08:29 warning [QB    ] epoll_ctl(del): Bad file descriptor (9)
Feb 08 05:08:29 warning [QB    ] epoll_ctl(del): Bad file descriptor (9)
Feb 08 05:08:29 notice  [MAIN  ] Corosync Cluster Engine exiting normally
Feb 08 05:08:37 warning [MAIN  ] nodeid in totem section is deprecated and ignored.
Feb 08 05:08:37 warning [MAIN  ] Please migrate config file to nodelist.
Feb 08 05:08:37 notice  [TOTEM ] Initializing transport (UDP/IP Multicast).
Feb 08 05:08:37 notice  [TOTEM ] Initializing transmit/receive security: libtomcrypt SOBER128/SHA1HMAC (mode 0).
Feb 08 05:08:37 notice  [TOTEM ] The network interface [192.168.122.104] is now up.
Feb 08 05:08:37 notice  [SERV  ] Service engine loaded: corosync extended virtual synchrony service [0]
Feb 08 05:08:37 info    [QB    ] server name: evs
Feb 08 05:08:37 notice  [SERV  ] Service engine loaded: corosync configuration service [7]
Feb 08 05:08:37 info    [QB    ] server name: cfg
Feb 08 05:08:37 notice  [SERV  ] Service engine loaded: corosync cluster closed process group service v1.01 [8]
Feb 08 05:08:37 info    [QB    ] server name: cpg
Feb 08 05:08:37 notice  [SERV  ] Service engine loaded: corosync profile loading service [13]
Feb 08 05:08:37 info    [QB    ] server name: pload
Feb 08 05:08:37 notice  [QUORUM] Using quorum provider corosync_votequorum
Feb 08 05:08:37 notice  [SERV  ] Service engine loaded: corosync vote quorum service v1.0 [15]
Feb 08 05:08:37 info    [QB    ] server name: votequorum
Feb 08 05:08:37 notice  [SERV  ] Service engine loaded: corosync cluster quorum service v0.1 [12]
Feb 08 05:08:37 info    [QB    ] server name: quorum
Feb 08 05:08:37 notice  [SERV  ] Service engine loaded: corosync configuration map access [22]
Feb 08 05:08:37 info    [QB    ] server name: cmap
Feb 08 05:08:37 notice  [MAIN  ] Compatibility mode set to whitetank.  Using V1 and V2 of the synchronization engine.
Feb 08 05:08:37 notice  [QUORUM] Members[1]: 1752869056
Feb 08 05:08:37 notice  [TOTEM ] A processor joined or left the membership and a new membership was formed.
Feb 08 05:08:37 notice  [MAIN  ] Completed service synchronization, ready to provide service.
Feb 08 05:08:37 info    [QB    ] IPC credentials authenticated
Feb 08 05:08:38 info    [QB    ] IPC credentials authenticated
Set r/w permissions for uid=169, gid=0 on /var/log/corosync.log
Feb 08 05:08:38 info    [QB    ] IPC credentials authenticated
Feb 08 05:08:38 info    [QB    ] IPC credentials authenticated
Feb 08 05:08:39 info    [QB    ] IPC credentials authenticated
Feb 08 05:08:39 [7088] pcmk-4      attrd: (   cluster.c:186   )  notice: crm_cluster_connect: 	Connecting to cluster infrastructure: corosync
Feb 08 05:08:39 info    [QB    ] IPC credentials authenticated
Feb 08 05:08:39 notice  [QUORUM] Members[4]: 1702537408 1719314624 1736091840 1752869056
Feb 08 05:08:39 notice  [TOTEM ] A processor joined or left the membership and a new membership was formed.
Feb 08 05:08:39 notice  [QUORUM] This node is within the primary component and will provide service.
Feb 08 05:08:39 notice  [QUORUM] Members[4]: 1702537408 1719314624 1736091840 1752869056
Feb 08 05:08:40 notice  [MAIN  ] Completed service synchronization, ready to provide service.
Feb 08 05:08:41 info    [QB    ] IPC credentials authenticated
Feb 08 05:08:42 info    [QB    ] IPC credentials authenticated
Feb 08 05:08:42 info    [QB    ] IPC credentials authenticated
Feb 08 05:14:00 info    [QB    ] IPC credentials authenticated
Feb 08 05:16:26 [7088] pcmk-4      attrd: (  corosync.c:1013  )   error: init_cpg_connection: 	Could not join the CPG group 'attrd': 23
Feb 08 05:16:26 [7088] pcmk-4      attrd: (     attrd.c:581   )   error: main: 	HA Signon failed
Feb 08 05:16:26 [7088] pcmk-4      attrd: (     attrd.c:608   )   error: main: 	Aborting startup

Comment 1 Andrew Beekhof 2012-02-09 00:24:46 UTC

4 node cluster.

Config follows:


totem {

        version: 2

        # How long before declaring a token lost (ms)
        token:          5000

        # How many token retransmits before forming a new configuration
        token_retransmits_before_loss_const: 10

        # How long to wait for join messages in the membership protocol (ms)
        join:           1000

        # How long to wait for consensus to be achieved before starting a new round of membership configuration (ms)
        consensus:      7500

        # Turn off the virtual synchrony filter
        vsftype:        none

        # Number of messages that may be sent by one processor on receipt of the token
        max_messages:   20

        # Stagger sending the node join messages by 1..send_join ms
        send_join:      45

        # Disable encryption
        secauth:	off

        # How many threads to use for encryption/decryption
        threads:   	0

        # Assign a fixed node id
        nodeid:         101

        interface {
                ringnumber: 0
                bindnetaddr: 192.168.122.101
                mcastaddr: 226.94.1.1
                mcastport: 6661
        }
}

logging {
        debug: off
        fileline: off
        to_syslog: yes
        to_stderr: no
        syslog_facility: daemon
        timestamp: on
        to_logfile: yes
        logfile: /var/log/corosync.log
}

amf {
        mode: disabled
}
quorum {
   provider: corosync_votequorum
   expected_votes: 4
   votes: 1
   two_node: 0
   wait_for_all: 0
   last_man_standing: 0
   auto_tie_breaker: 0
}

Comment 2 Angus Salkeld 2012-02-13 10:27:01 UTC

Analysis of what is happening:

cpg_init()
cpg_nodeid_get()

^
1] in the first call the send works but the recv gets an EAGAIN
2] the response message gets written to the response ringbuffer
3] it retries and sends a new message
4] all works fine (one out of 2 responses read)

cpg_join()
5] does a send and a recv
6] the recv fails with -ENOBUF as there is not enough space
   provided by the caller to read the old cpg_nodeid_get()
   response.
7] we are now stuck.


Solutions:
1] block to recv (keep retrying if we get EAGAIN)
   this will probably solve most of the problems what about ETIMEOUT?
   the api allows timeouts.

2] send a sequence number and drop any old (defunct) messages
   with the incorrect sequence number.

I prefer "2" but it will mean adding a sequence integer to the
IPC headers (will need a rebuild of corosync)

struct qb_ipc_request_header {
	int32_t id __attribute__ ((aligned(8)));
	int32_t size __attribute__ ((aligned(8)));
	uint32_t sequence __attribute__ ((aligned(8))); // <---- here
} __attribute__ ((aligned(8)));

struct qb_ipc_response_header {
	int32_t id __attribute__ ((aligned(8)));
	int32_t size __attribute__ ((aligned(8)));
	int32_t error __attribute__ ((aligned(8)));
	uint32_t sequence __attribute__ ((aligned(8))); // <---- here
} __attribute__ ((aligned(8)));

Comment 3 Andrew Beekhof 2012-02-13 23:07:26 UTC

Personally I think you want both to avoid further overloading the server.

Ie. retry until $timeout and then discard.

Otherwise, since the upper level is just going to retry anyway, we're just sending duplicate requests to the server.  Better to wait for the original reply if possible.

Comment 4 Angus Salkeld 2012-02-14 06:00:15 UTC

So libqb was missing a bit of logic that coroipc had.

Once a send has succeeded we really want to wait for the response.
The problem with sem_wait is it can cause the client to lockup if
corosync exits/dies.
So we call sem_wait with a timeout of 2 seconds and use poll to
check the liveliness of the server.

This has now been committed upstream:
https://github.com/asalkeld/libqb/commit/d633b4e2fa8d9cf07e9df93c1a37aa795aa10c8b

It also requires a change to corosync (but only if using with latest libqb):

diff --git a/include/corosync/corotypes.h b/include/corosync/corotypes.h
index 87229ad..d5a7d01 100644
--- a/include/corosync/corotypes.h
+++ b/include/corosync/corotypes.h
@@ -108,7 +108,7 @@ typedef enum {
    CS_ERR_SECURITY = 100
 } cs_error_t;
 
-#define CS_IPC_TIMEOUT_MS 1000
+#define CS_IPC_TIMEOUT_MS -1
 
 #define CS_TIME_MS_IN_SEC   1000ULL
 #define CS_TIME_US_IN_SEC   1000000ULL

Comment 5 Fabio Massimo Di Nitto 2012-02-14 07:30:25 UTC

If it fixes the problem for Andrew, corosync change is ACK.

Let´s get this into the beta refresh so Andrew´s testing can be unblocked.