| Summary: | cpg_join() fails with rc = 23 (queue full) at node startup | ||
|---|---|---|---|
| Product: | [Fedora] Fedora | Reporter: | Andrew Beekhof <abeekhof> |
| Component: | corosync | Assignee: | Angus Salkeld <asalkeld> |
| Status: | CLOSED CURRENTRELEASE | QA Contact: | Fedora Extras Quality Assurance <extras-qa> |
| Severity: | high | Docs Contact: | |
| Priority: | unspecified | ||
| Version: | rawhide | CC: | agk, asalkeld, fdinitto, jfriesse, sdake |
| Target Milestone: | --- | ||
| Target Release: | --- | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | Bug Fix | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2012-06-12 23:56:19 UTC | Type: | --- |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
4 node cluster.
Config follows:
totem {
version: 2
# How long before declaring a token lost (ms)
token: 5000
# How many token retransmits before forming a new configuration
token_retransmits_before_loss_const: 10
# How long to wait for join messages in the membership protocol (ms)
join: 1000
# How long to wait for consensus to be achieved before starting a new round of membership configuration (ms)
consensus: 7500
# Turn off the virtual synchrony filter
vsftype: none
# Number of messages that may be sent by one processor on receipt of the token
max_messages: 20
# Stagger sending the node join messages by 1..send_join ms
send_join: 45
# Disable encryption
secauth: off
# How many threads to use for encryption/decryption
threads: 0
# Assign a fixed node id
nodeid: 101
interface {
ringnumber: 0
bindnetaddr: 192.168.122.101
mcastaddr: 226.94.1.1
mcastport: 6661
}
}
logging {
debug: off
fileline: off
to_syslog: yes
to_stderr: no
syslog_facility: daemon
timestamp: on
to_logfile: yes
logfile: /var/log/corosync.log
}
amf {
mode: disabled
}
quorum {
provider: corosync_votequorum
expected_votes: 4
votes: 1
two_node: 0
wait_for_all: 0
last_man_standing: 0
auto_tie_breaker: 0
}
Analysis of what is happening:
cpg_init()
cpg_nodeid_get()
^
1] in the first call the send works but the recv gets an EAGAIN
2] the response message gets written to the response ringbuffer
3] it retries and sends a new message
4] all works fine (one out of 2 responses read)
cpg_join()
5] does a send and a recv
6] the recv fails with -ENOBUF as there is not enough space
provided by the caller to read the old cpg_nodeid_get()
response.
7] we are now stuck.
Solutions:
1] block to recv (keep retrying if we get EAGAIN)
this will probably solve most of the problems what about ETIMEOUT?
the api allows timeouts.
2] send a sequence number and drop any old (defunct) messages
with the incorrect sequence number.
I prefer "2" but it will mean adding a sequence integer to the
IPC headers (will need a rebuild of corosync)
struct qb_ipc_request_header {
int32_t id __attribute__ ((aligned(8)));
int32_t size __attribute__ ((aligned(8)));
uint32_t sequence __attribute__ ((aligned(8))); // <---- here
} __attribute__ ((aligned(8)));
struct qb_ipc_response_header {
int32_t id __attribute__ ((aligned(8)));
int32_t size __attribute__ ((aligned(8)));
int32_t error __attribute__ ((aligned(8)));
uint32_t sequence __attribute__ ((aligned(8))); // <---- here
} __attribute__ ((aligned(8)));
Personally I think you want both to avoid further overloading the server. Ie. retry until $timeout and then discard. Otherwise, since the upper level is just going to retry anyway, we're just sending duplicate requests to the server. Better to wait for the original reply if possible. So libqb was missing a bit of logic that coroipc had. Once a send has succeeded we really want to wait for the response. The problem with sem_wait is it can cause the client to lockup if corosync exits/dies. So we call sem_wait with a timeout of 2 seconds and use poll to check the liveliness of the server. This has now been committed upstream: https://github.com/asalkeld/libqb/commit/d633b4e2fa8d9cf07e9df93c1a37aa795aa10c8b It also requires a change to corosync (but only if using with latest libqb): diff --git a/include/corosync/corotypes.h b/include/corosync/corotypes.h index 87229ad..d5a7d01 100644 --- a/include/corosync/corotypes.h +++ b/include/corosync/corotypes.h @@ -108,7 +108,7 @@ typedef enum { CS_ERR_SECURITY = 100 } cs_error_t; -#define CS_IPC_TIMEOUT_MS 1000 +#define CS_IPC_TIMEOUT_MS -1 #define CS_TIME_MS_IN_SEC 1000ULL #define CS_TIME_US_IN_SEC 1000000ULL If it fixes the problem for Andrew, corosync change is ACK. Let´s get this into the beta refresh so Andrew´s testing can be unblocked. |
Description of problem: $subject Given the timing (node startup), the queue full error seems unlikely to be legitimate. Version-Release number of selected component (if applicable): Source RPM : corosync-1.99.2-1.fc17.src.rpm Source RPM : libqb-0.10.0-1.fc17.src.rpm How reproducible: Frequent and fatal Steps to Reproduce: 1. Corosync is started followed by Pacemaker. 2. Be unlucky, or possibly have another node join the cluster at the same time 3. Actual results: Atrrd starts trying to connect at 05:08:39 and retries until 05:16:26 before giving up. Expected results: Connection eventually succeeds. Additional info: The code from init_cpg_connection() retries = 0; cs_repeat(retries, 30, rc = cpg_join(pcmk_cpg_handle, &pcmk_cpg_group)); if (rc != CS_OK) { crm_err("Could not join the CPG group '%s': %d", crm_system_name, rc); goto bail; } The cs_repeat macro: #define cs_repeat(counter, max, code) do { \ code; \ if(rc == CS_ERR_TRY_AGAIN || rc == CS_ERR_QUEUE_FULL) { \ counter++; \ crm_debug("Retrying operation after %ds", counter); \ sleep(counter); \ } else { \ break; \ } \ } while(counter < max) Note that both of the following calls have succeeded before cpg_join() is called. cs_repeat(retries, 30, rc = cpg_initialize(&pcmk_cpg_handle, &cpg_callbacks)); cs_repeat(retries, 30, rc = cpg_local_get(pcmk_cpg_handle, (unsigned int *)nodeid)); Feb 08 05:08:28 notice [SERV ] Unloading all Corosync service engines. Feb 08 05:08:28 info [QB ] withdrawing server sockets Feb 08 05:08:28 notice [SERV ] Service engine unloaded: corosync vote quorum service v1.0 Feb 08 05:08:28 info [QB ] withdrawing server sockets Feb 08 05:08:28 notice [SERV ] Service engine unloaded: corosync extended virtual synchrony service Feb 08 05:08:28 info [QB ] withdrawing server sockets Feb 08 05:08:28 notice [SERV ] Service engine unloaded: corosync configuration service Feb 08 05:08:28 info [QB ] withdrawing server sockets Feb 08 05:08:28 notice [SERV ] Service engine unloaded: corosync cluster closed process group service v1.01 Feb 08 05:08:28 info [QB ] withdrawing server sockets Feb 08 05:08:28 notice [SERV ] Service engine unloaded: corosync cluster quorum service v0.1 Feb 08 05:08:28 info [QB ] withdrawing server sockets Feb 08 05:08:28 notice [SERV ] Service engine unloaded: corosync profile loading service Feb 08 05:08:29 info [QB ] withdrawing server sockets Feb 08 05:08:29 notice [SERV ] Service engine unloaded: corosync configuration map access Feb 08 05:08:29 warning [QB ] epoll_ctl(del): Bad file descriptor (9) Feb 08 05:08:29 warning [QB ] epoll_ctl(del): Bad file descriptor (9) Feb 08 05:08:29 notice [MAIN ] Corosync Cluster Engine exiting normally Feb 08 05:08:37 warning [MAIN ] nodeid in totem section is deprecated and ignored. Feb 08 05:08:37 warning [MAIN ] Please migrate config file to nodelist. Feb 08 05:08:37 notice [TOTEM ] Initializing transport (UDP/IP Multicast). Feb 08 05:08:37 notice [TOTEM ] Initializing transmit/receive security: libtomcrypt SOBER128/SHA1HMAC (mode 0). Feb 08 05:08:37 notice [TOTEM ] The network interface [192.168.122.104] is now up. Feb 08 05:08:37 notice [SERV ] Service engine loaded: corosync extended virtual synchrony service [0] Feb 08 05:08:37 info [QB ] server name: evs Feb 08 05:08:37 notice [SERV ] Service engine loaded: corosync configuration service [7] Feb 08 05:08:37 info [QB ] server name: cfg Feb 08 05:08:37 notice [SERV ] Service engine loaded: corosync cluster closed process group service v1.01 [8] Feb 08 05:08:37 info [QB ] server name: cpg Feb 08 05:08:37 notice [SERV ] Service engine loaded: corosync profile loading service [13] Feb 08 05:08:37 info [QB ] server name: pload Feb 08 05:08:37 notice [QUORUM] Using quorum provider corosync_votequorum Feb 08 05:08:37 notice [SERV ] Service engine loaded: corosync vote quorum service v1.0 [15] Feb 08 05:08:37 info [QB ] server name: votequorum Feb 08 05:08:37 notice [SERV ] Service engine loaded: corosync cluster quorum service v0.1 [12] Feb 08 05:08:37 info [QB ] server name: quorum Feb 08 05:08:37 notice [SERV ] Service engine loaded: corosync configuration map access [22] Feb 08 05:08:37 info [QB ] server name: cmap Feb 08 05:08:37 notice [MAIN ] Compatibility mode set to whitetank. Using V1 and V2 of the synchronization engine. Feb 08 05:08:37 notice [QUORUM] Members[1]: 1752869056 Feb 08 05:08:37 notice [TOTEM ] A processor joined or left the membership and a new membership was formed. Feb 08 05:08:37 notice [MAIN ] Completed service synchronization, ready to provide service. Feb 08 05:08:37 info [QB ] IPC credentials authenticated Feb 08 05:08:38 info [QB ] IPC credentials authenticated Set r/w permissions for uid=169, gid=0 on /var/log/corosync.log Feb 08 05:08:38 info [QB ] IPC credentials authenticated Feb 08 05:08:38 info [QB ] IPC credentials authenticated Feb 08 05:08:39 info [QB ] IPC credentials authenticated Feb 08 05:08:39 [7088] pcmk-4 attrd: ( cluster.c:186 ) notice: crm_cluster_connect: Connecting to cluster infrastructure: corosync Feb 08 05:08:39 info [QB ] IPC credentials authenticated Feb 08 05:08:39 notice [QUORUM] Members[4]: 1702537408 1719314624 1736091840 1752869056 Feb 08 05:08:39 notice [TOTEM ] A processor joined or left the membership and a new membership was formed. Feb 08 05:08:39 notice [QUORUM] This node is within the primary component and will provide service. Feb 08 05:08:39 notice [QUORUM] Members[4]: 1702537408 1719314624 1736091840 1752869056 Feb 08 05:08:40 notice [MAIN ] Completed service synchronization, ready to provide service. Feb 08 05:08:41 info [QB ] IPC credentials authenticated Feb 08 05:08:42 info [QB ] IPC credentials authenticated Feb 08 05:08:42 info [QB ] IPC credentials authenticated Feb 08 05:14:00 info [QB ] IPC credentials authenticated Feb 08 05:16:26 [7088] pcmk-4 attrd: ( corosync.c:1013 ) error: init_cpg_connection: Could not join the CPG group 'attrd': 23 Feb 08 05:16:26 [7088] pcmk-4 attrd: ( attrd.c:581 ) error: main: HA Signon failed Feb 08 05:16:26 [7088] pcmk-4 attrd: ( attrd.c:608 ) error: main: Aborting startup