Bug 373671 - cman throwing away messages
cman throwing away messages
Product: Red Hat Cluster Suite
Classification: Red Hat
Component: cman-kernel (Show other bugs)
All Linux
urgent Severity urgent
: ---
: ---
Assigned To: Christine Caulfield
Cluster QE
: 373711 (view as bug list)
Depends On:
Blocks: 430283 430284
  Show dependency treegraph
Reported: 2007-11-09 15:22 EST by David Teigland
Modified: 2009-04-16 15:46 EDT (History)
6 users (show)

See Also:
Fixed In Version: RHBA-2008-0800
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Last Closed: 2008-07-25 15:09:52 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---

Attachments (Terms of Use)

  None (edit)
Description David Teigland 2007-11-09 15:22:44 EST
Description of problem:

bug 299061 comment #39

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:
Actual results:

Expected results:

Additional info:
Comment 4 Christine Caulfield 2007-11-14 09:18:06 EST
*** Bug 373711 has been marked as a duplicate of this bug. ***
Comment 5 Christine Caulfield 2007-11-14 10:46:39 EST
For the record I'll explain what happens when this bug is hit. It might help
identify other instances.

- cman sends out a message that needs an ACK (eg a membership, barrier or SM
- that message gets delayed in the network stack for some reason (eg line error
or busy network)
- A HELLO message then gets sent out. This will have a sequence number higher
than the message which is still delayed
- Remote node sees the HELLO message and makes a note of the sequence number
- The original message gets sent out
- Remote node sees that message, but discards it because its sequence number is
lower than the HELLO message it just saw.

The fix is to check the last ACKable message sequence number seen rather than
just the last sequence number seen.

Because of the way transition works it's unlikely to be seen in normal node
up/down use but if a lot of messages are sent (eg during mount/umount) there is
always the chance that one could cross with a HELLO message.

For some reason we are seeing more of these hangs in house. I wonder if this is
increased network load or some dodgy cards starting to fail ;-)

This bug has been in the code since RHEL4U3.
Comment 6 Christine Caulfield 2007-11-14 11:10:45 EST
Checked in for RHEL46

Checking in cnxman.c;
/cvs/cluster/cluster/cman-kernel/src/Attic/cnxman.c,v  <--  cnxman.c
new revision:; previous revision:
Comment 11 Chris Feist 2007-11-16 15:08:55 EST
Re-opening, is this really a bug?  If it isn't, then we need to pull the fix out
of the RHEL46 branch.
Comment 12 Christine Caulfield 2007-11-19 04:04:19 EST
It is a real bug, yes. And one I can reproduce (with some help!).

It's pretty hard to hit though and given the other problems it seems to be
causing (though I can't think why) feel free to take it out.
Comment 13 Christine Caulfield 2007-11-19 10:57:51 EST
OK, I've identified a serious problem with this patch. You'd think after last
time I would have learned my lesson really!

I'm on the way to producing a working one, but I'll make a note that /any/
fiddling with the cman protocol needs a LOT of testing before we even think of
putting it into a release.
Comment 14 Christine Caulfield 2007-11-21 04:20:13 EST
A proper (I hope) on the RHEL4 branch:

Checking in cnxman.c;
/cvs/cluster/cluster/cman-kernel/src/Attic/cnxman.c,v  <--  cnxman.c
new revision:; previous revision:
Checking in membership.c;
/cvs/cluster/cluster/cman-kernel/src/Attic/membership.c,v  <--  membership.c
new revision:; previous revision:
Comment 23 Christine Caulfield 2008-01-29 05:19:53 EST
As the fix is already in the RHEL4 branch it should get picked up for 4.7, so
I'll set this to MODFIED.

The Zstream for 4.6 will need checking into the RHEL46 branch I assume then ?
Comment 26 errata-xmlrpc 2008-07-25 15:09:52 EDT
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.


Note You need to log in before you can comment on or make changes to this bug.