Created attachment 562090 [details] Tarball including logs, corosync-objctl -a, and blackbox data Description of problem: Calls to coroipcc_msg_send_reply_receive() returns CS_ERR_TRY_AGAIN over a period of minutes and never succeeds. Version-Release number of selected component (if applicable): Works: Name : corosync Relocations: (not relocatable) Version : 1.4.1 Vendor: Red Hat, Inc. Release : 5.el6 Build Date: Tue 31 Jan 2012 10:22:05 CET Does not work: Name : corosync Version : 1.4.2 Release : 1.fc16 How reproducible: Every time Steps to Reproduce: 1. install pacemaker from http://www.clusterlabs.org/rpm-test-next (mock built for f-16) 2. start corosync 3. start pacemaker 4. grep for ERROR: and "Peer overloaded: Re-sending message" Actual results: Logs of the form: Feb 14 23:10:26 pcmk-1 cib[14855]: info: get_ais_nodeid: Peer overloaded: Re-sending message (Attempt 8 of 20) Expected results: Sending eventually succeeds Additional info:
Breakpoint 1, corosync_sending_allowed (service=9, id=598, msg=0x7f1dece7f000, sending_allowed_private_data=0xbd66c8) at main.c:1006 1006 { (gdb) where #0 corosync_sending_allowed (service=9, id=598, msg=0x7f1dece7f000, sending_allowed_private_data=0xbd66c8) at main.c:1006 #1 0x00007f1df58e7d5d in pthread_ipc_consumer (conn=0xbd2580) at coroipcs.c:698 #2 0x000000392b007d90 in start_thread (arg=0x7f1decb7e440) at pthread_create.c:309 #3 0x000000392a8ef48d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115 (gdb) up #1 0x00007f1df58e7d5d in pthread_ipc_consumer (conn=0xbd2580) at coroipcs.c:698 698 send_ok = api->sending_allowed (conn_info->service, (gdb) print *header $1 = {size = 0, id = 598} size and id are swapped. In an attempt to work with 1.4 and 2.0 I had lines 31-44 of: https://github.com/ClusterLabs/pacemaker/blob/eaf865ce03c44529e204147f8eb28714d3142a6b/include/crm/ais.h#L31 But since: /usr/include/qb/qbipc_common.h:34:struct qb_ipc_request_header { /usr/include/qb/qbipc_common.h-35- int32_t id __attribute__ ((aligned(8))); /usr/include/qb/qbipc_common.h-36- int32_t size __attribute__ ((aligned(8))); /usr/include/qb/qbipc_common.h-37-} __attribute__ ((aligned(8))); /usr/include/corosync/coroipc_types.h-38-typedef struct { /usr/include/corosync/coroipc_types.h-39- int size __attribute__((aligned(8))); /usr/include/corosync/coroipc_types.h-40- int id __attribute__((aligned(8))); /usr/include/corosync/coroipc_types.h:41:} coroipc_request_header_t __attribute__((aligned(8))); That won't work. Why do you guys hate me?
A related patch has been committed upstream: https://github.com/beekhof/pacemaker/commit/465232a
A related patch has been committed upstream: https://github.com/beekhof/pacemaker/commit/7fe02af
I can confirm these patches fix the issue for me.
Changing component and closing