Bug 373671 - cman throwing away messages
Summary: cman throwing away messages
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Cluster Suite
Classification: Retired
Component: cman-kernel
Version: 4
Hardware: All
OS: Linux
urgent
urgent
Target Milestone: ---
Assignee: Christine Caulfield
QA Contact: Cluster QE
URL:
Whiteboard:
: 373711 (view as bug list)
Depends On:
Blocks: 430283 430284
TreeView+ depends on / blocked
 
Reported: 2007-11-09 20:22 UTC by David Teigland
Modified: 2009-04-16 19:46 UTC (History)
6 users (show)

Fixed In Version: RHBA-2008-0800
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2008-07-25 19:09:52 UTC
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2008:0800 0 normal SHIPPED_LIVE cman-kernel bug fix update 2008-07-25 19:09:49 UTC

Description David Teigland 2007-11-09 20:22:44 UTC
Description of problem:

bug 299061 comment #39

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.
  
Actual results:


Expected results:


Additional info:

Comment 4 Christine Caulfield 2007-11-14 14:18:06 UTC
*** Bug 373711 has been marked as a duplicate of this bug. ***

Comment 5 Christine Caulfield 2007-11-14 15:46:39 UTC
For the record I'll explain what happens when this bug is hit. It might help
identify other instances.

- cman sends out a message that needs an ACK (eg a membership, barrier or SM
message)
- that message gets delayed in the network stack for some reason (eg line error
or busy network)
- A HELLO message then gets sent out. This will have a sequence number higher
than the message which is still delayed
- Remote node sees the HELLO message and makes a note of the sequence number
- The original message gets sent out
- Remote node sees that message, but discards it because its sequence number is
lower than the HELLO message it just saw.

The fix is to check the last ACKable message sequence number seen rather than
just the last sequence number seen.

Because of the way transition works it's unlikely to be seen in normal node
up/down use but if a lot of messages are sent (eg during mount/umount) there is
always the chance that one could cross with a HELLO message.

For some reason we are seeing more of these hangs in house. I wonder if this is
increased network load or some dodgy cards starting to fail ;-)

This bug has been in the code since RHEL4U3.

Comment 6 Christine Caulfield 2007-11-14 16:10:45 UTC
Checked in for RHEL46

Checking in cnxman.c;
/cvs/cluster/cluster/cman-kernel/src/Attic/cnxman.c,v  <--  cnxman.c
new revision: 1.42.2.29.2.1; previous revision: 1.42.2.29
done


Comment 11 Chris Feist 2007-11-16 20:08:55 UTC
Re-opening, is this really a bug?  If it isn't, then we need to pull the fix out
of the RHEL46 branch.

Comment 12 Christine Caulfield 2007-11-19 09:04:19 UTC
It is a real bug, yes. And one I can reproduce (with some help!).

It's pretty hard to hit though and given the other problems it seems to be
causing (though I can't think why) feel free to take it out.

Comment 13 Christine Caulfield 2007-11-19 15:57:51 UTC
OK, I've identified a serious problem with this patch. You'd think after last
time I would have learned my lesson really!

I'm on the way to producing a working one, but I'll make a note that /any/
fiddling with the cman protocol needs a LOT of testing before we even think of
putting it into a release.


Comment 14 Christine Caulfield 2007-11-21 09:20:13 UTC
A proper (I hope) on the RHEL4 branch:

Checking in cnxman.c;
/cvs/cluster/cluster/cman-kernel/src/Attic/cnxman.c,v  <--  cnxman.c
new revision: 1.42.2.31; previous revision: 1.42.2.30
done
Checking in membership.c;
/cvs/cluster/cluster/cman-kernel/src/Attic/membership.c,v  <--  membership.c
new revision: 1.44.2.29; previous revision: 1.44.2.28
done

Comment 23 Christine Caulfield 2008-01-29 10:19:53 UTC
As the fix is already in the RHEL4 branch it should get picked up for 4.7, so
I'll set this to MODFIED.

The Zstream for 4.6 will need checking into the RHEL46 branch I assume then ?

Comment 26 errata-xmlrpc 2008-07-25 19:09:52 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2008-0800.html



Note You need to log in before you can comment on or make changes to this bug.