Bug 373671

Summary: cman throwing away messages
Product: [Retired] Red Hat Cluster Suite Reporter: David Teigland <teigland>
Component: cman-kernelAssignee: Christine Caulfield <ccaulfie>
Status: CLOSED ERRATA QA Contact: Cluster QE <mspqa-list>
Severity: urgent Docs Contact:
Priority: urgent    
Version: 4CC: akarlsso, bkahn, cluster-maint, djansa, djuran, teigland
Target Milestone: ---   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: RHBA-2008-0800 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2008-07-25 19:09:52 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 430283, 430284    

Description David Teigland 2007-11-09 20:22:44 UTC
Description of problem:

bug 299061 comment #39

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.
  
Actual results:


Expected results:


Additional info:

Comment 4 Christine Caulfield 2007-11-14 14:18:06 UTC
*** Bug 373711 has been marked as a duplicate of this bug. ***

Comment 5 Christine Caulfield 2007-11-14 15:46:39 UTC
For the record I'll explain what happens when this bug is hit. It might help
identify other instances.

- cman sends out a message that needs an ACK (eg a membership, barrier or SM
message)
- that message gets delayed in the network stack for some reason (eg line error
or busy network)
- A HELLO message then gets sent out. This will have a sequence number higher
than the message which is still delayed
- Remote node sees the HELLO message and makes a note of the sequence number
- The original message gets sent out
- Remote node sees that message, but discards it because its sequence number is
lower than the HELLO message it just saw.

The fix is to check the last ACKable message sequence number seen rather than
just the last sequence number seen.

Because of the way transition works it's unlikely to be seen in normal node
up/down use but if a lot of messages are sent (eg during mount/umount) there is
always the chance that one could cross with a HELLO message.

For some reason we are seeing more of these hangs in house. I wonder if this is
increased network load or some dodgy cards starting to fail ;-)

This bug has been in the code since RHEL4U3.

Comment 6 Christine Caulfield 2007-11-14 16:10:45 UTC
Checked in for RHEL46

Checking in cnxman.c;
/cvs/cluster/cluster/cman-kernel/src/Attic/cnxman.c,v  <--  cnxman.c
new revision: 1.42.2.29.2.1; previous revision: 1.42.2.29
done


Comment 11 Chris Feist 2007-11-16 20:08:55 UTC
Re-opening, is this really a bug?  If it isn't, then we need to pull the fix out
of the RHEL46 branch.

Comment 12 Christine Caulfield 2007-11-19 09:04:19 UTC
It is a real bug, yes. And one I can reproduce (with some help!).

It's pretty hard to hit though and given the other problems it seems to be
causing (though I can't think why) feel free to take it out.

Comment 13 Christine Caulfield 2007-11-19 15:57:51 UTC
OK, I've identified a serious problem with this patch. You'd think after last
time I would have learned my lesson really!

I'm on the way to producing a working one, but I'll make a note that /any/
fiddling with the cman protocol needs a LOT of testing before we even think of
putting it into a release.


Comment 14 Christine Caulfield 2007-11-21 09:20:13 UTC
A proper (I hope) on the RHEL4 branch:

Checking in cnxman.c;
/cvs/cluster/cluster/cman-kernel/src/Attic/cnxman.c,v  <--  cnxman.c
new revision: 1.42.2.31; previous revision: 1.42.2.30
done
Checking in membership.c;
/cvs/cluster/cluster/cman-kernel/src/Attic/membership.c,v  <--  membership.c
new revision: 1.44.2.29; previous revision: 1.44.2.28
done

Comment 23 Christine Caulfield 2008-01-29 10:19:53 UTC
As the fix is already in the RHEL4 branch it should get picked up for 4.7, so
I'll set this to MODFIED.

The Zstream for 4.6 will need checking into the RHEL46 branch I assume then ?

Comment 26 errata-xmlrpc 2008-07-25 19:09:52 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2008-0800.html