Bug 373671
Summary: | cman throwing away messages | ||
---|---|---|---|
Product: | [Retired] Red Hat Cluster Suite | Reporter: | David Teigland <teigland> |
Component: | cman-kernel | Assignee: | Christine Caulfield <ccaulfie> |
Status: | CLOSED ERRATA | QA Contact: | Cluster QE <mspqa-list> |
Severity: | urgent | Docs Contact: | |
Priority: | urgent | ||
Version: | 4 | CC: | akarlsso, bkahn, cluster-maint, djansa, djuran, teigland |
Target Milestone: | --- | ||
Target Release: | --- | ||
Hardware: | All | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | RHBA-2008-0800 | Doc Type: | Bug Fix |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2008-07-25 19:09:52 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | |||
Bug Blocks: | 430283, 430284 |
Description
David Teigland
2007-11-09 20:22:44 UTC
*** Bug 373711 has been marked as a duplicate of this bug. *** For the record I'll explain what happens when this bug is hit. It might help identify other instances. - cman sends out a message that needs an ACK (eg a membership, barrier or SM message) - that message gets delayed in the network stack for some reason (eg line error or busy network) - A HELLO message then gets sent out. This will have a sequence number higher than the message which is still delayed - Remote node sees the HELLO message and makes a note of the sequence number - The original message gets sent out - Remote node sees that message, but discards it because its sequence number is lower than the HELLO message it just saw. The fix is to check the last ACKable message sequence number seen rather than just the last sequence number seen. Because of the way transition works it's unlikely to be seen in normal node up/down use but if a lot of messages are sent (eg during mount/umount) there is always the chance that one could cross with a HELLO message. For some reason we are seeing more of these hangs in house. I wonder if this is increased network load or some dodgy cards starting to fail ;-) This bug has been in the code since RHEL4U3. Checked in for RHEL46 Checking in cnxman.c; /cvs/cluster/cluster/cman-kernel/src/Attic/cnxman.c,v <-- cnxman.c new revision: 1.42.2.29.2.1; previous revision: 1.42.2.29 done Re-opening, is this really a bug? If it isn't, then we need to pull the fix out of the RHEL46 branch. It is a real bug, yes. And one I can reproduce (with some help!). It's pretty hard to hit though and given the other problems it seems to be causing (though I can't think why) feel free to take it out. OK, I've identified a serious problem with this patch. You'd think after last time I would have learned my lesson really! I'm on the way to producing a working one, but I'll make a note that /any/ fiddling with the cman protocol needs a LOT of testing before we even think of putting it into a release. A proper (I hope) on the RHEL4 branch: Checking in cnxman.c; /cvs/cluster/cluster/cman-kernel/src/Attic/cnxman.c,v <-- cnxman.c new revision: 1.42.2.31; previous revision: 1.42.2.30 done Checking in membership.c; /cvs/cluster/cluster/cman-kernel/src/Attic/membership.c,v <-- membership.c new revision: 1.44.2.29; previous revision: 1.44.2.28 done As the fix is already in the RHEL4 branch it should get picked up for 4.7, so I'll set this to MODFIED. The Zstream for 4.6 will need checking into the RHEL46 branch I assume then ? An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2008-0800.html |