Bug 790627 - Cannot send cluster messages
Summary: Cannot send cluster messages
Keywords:
Status: CLOSED UPSTREAM
Alias: None
Product: Fedora
Classification: Fedora
Component: pacemaker
Version: 16
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
Assignee: Andrew Beekhof
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2012-02-14 23:34 UTC by Andrew Beekhof
Modified: 2012-02-16 00:13 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2012-02-16 00:13:58 UTC
Type: ---


Attachments (Terms of Use)
Tarball including logs, corosync-objctl -a, and blackbox data (623.98 KB, application/x-bzip2)
2012-02-14 23:34 UTC, Andrew Beekhof
no flags Details

Description Andrew Beekhof 2012-02-14 23:34:54 UTC
Created attachment 562090 [details]
Tarball including logs, corosync-objctl -a, and blackbox data

Description of problem:

Calls to coroipcc_msg_send_reply_receive() returns CS_ERR_TRY_AGAIN over a period of minutes and never succeeds. 

Version-Release number of selected component (if applicable):

Works:
Name        : corosync                     Relocations: (not relocatable)
Version     : 1.4.1                             Vendor: Red Hat, Inc.
Release     : 5.el6                         Build Date: Tue 31 Jan 2012 10:22:05 CET

Does not work:
Name        : corosync
Version     : 1.4.2
Release     : 1.fc16

How reproducible:

Every time

Steps to Reproduce:
1. install pacemaker from http://www.clusterlabs.org/rpm-test-next (mock built for f-16)
2. start corosync
3. start pacemaker
4. grep for ERROR: and "Peer overloaded: Re-sending message"
  
Actual results:

Logs of the form:

Feb 14 23:10:26 pcmk-1 cib[14855]:     info: get_ais_nodeid: Peer overloaded: Re-sending message (Attempt 8 of 20)

Expected results:

Sending eventually succeeds

Additional info:

Comment 1 Andrew Beekhof 2012-02-15 09:19:58 UTC
Breakpoint 1, corosync_sending_allowed (service=9, id=598, msg=0x7f1dece7f000, sending_allowed_private_data=0xbd66c8) at main.c:1006
1006	{
(gdb) where
#0  corosync_sending_allowed (service=9, id=598, msg=0x7f1dece7f000, sending_allowed_private_data=0xbd66c8) at main.c:1006
#1  0x00007f1df58e7d5d in pthread_ipc_consumer (conn=0xbd2580) at coroipcs.c:698
#2  0x000000392b007d90 in start_thread (arg=0x7f1decb7e440) at pthread_create.c:309
#3  0x000000392a8ef48d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115
(gdb) up
#1  0x00007f1df58e7d5d in pthread_ipc_consumer (conn=0xbd2580) at coroipcs.c:698
698			send_ok = api->sending_allowed (conn_info->service,
(gdb) print *header
$1 = {size = 0, id = 598}

size and id are swapped.

In an attempt to work with 1.4 and 2.0 I had lines 31-44 of:
   https://github.com/ClusterLabs/pacemaker/blob/eaf865ce03c44529e204147f8eb28714d3142a6b/include/crm/ais.h#L31

But since:

/usr/include/qb/qbipc_common.h:34:struct qb_ipc_request_header {
/usr/include/qb/qbipc_common.h-35-	int32_t id __attribute__ ((aligned(8)));
/usr/include/qb/qbipc_common.h-36-	int32_t size __attribute__ ((aligned(8)));
/usr/include/qb/qbipc_common.h-37-} __attribute__ ((aligned(8)));

/usr/include/corosync/coroipc_types.h-38-typedef struct {
/usr/include/corosync/coroipc_types.h-39-	int size __attribute__((aligned(8)));
/usr/include/corosync/coroipc_types.h-40-	int id __attribute__((aligned(8)));
/usr/include/corosync/coroipc_types.h:41:} coroipc_request_header_t __attribute__((aligned(8)));

That won't work.

Why do you guys hate me?

Comment 2 Andrew Beekhof 2012-02-15 10:06:42 UTC
A related patch has been committed upstream: https://github.com/beekhof/pacemaker/commit/465232a

Comment 3 Andrew Beekhof 2012-02-15 10:06:58 UTC
A related patch has been committed upstream: https://github.com/beekhof/pacemaker/commit/7fe02af

Comment 4 David Vossel 2012-02-15 15:29:25 UTC
I can confirm these patches fix the issue for me.

Comment 5 Andrew Beekhof 2012-02-16 00:13:58 UTC
Changing component and closing


Note You need to log in before you can comment on or make changes to this bug.