Bug 854216
Summary: | [TOTEM] FAILED TO RECEIVE + corosync crash | ||||||
---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 6 | Reporter: | Jan Pokorný [poki] <jpokorny> | ||||
Component: | corosync | Assignee: | Jan Friesse <jfriesse> | ||||
Status: | CLOSED ERRATA | QA Contact: | Cluster QE <mspqa-list> | ||||
Severity: | high | Docs Contact: | |||||
Priority: | high | ||||||
Version: | 6.3 | CC: | abeekhof, cphillip, dubrsl, fyjm2010, jfriesse, jkortus, jruemker, sbradley, sdake, slevine, sradvan | ||||
Target Milestone: | rc | ||||||
Target Release: | --- | ||||||
Hardware: | Unspecified | ||||||
OS: | Unspecified | ||||||
Whiteboard: | |||||||
Fixed In Version: | corosync-1.4.1-16.el6 | Doc Type: | Bug Fix | ||||
Doc Text: |
Cause:
Corosync is running on faulty (or rate limited) network
Consequence:
Message packets can be lost. This is no problem, because corosync is ready for lost/duplicate packets. If one message is lost for failed_to_recv times (configurable in config file), corosync will log "FAILED TO RECEIVE" and begin new round of membership algorithm. Very often (like 75% chance) is that corosync will segfault.
Fix:
There is assert inside function which compare membership. Because packets may be lost in this, we can result in situation where: failed_list is same as membership_list. This is normally problem (and doesn't happen) and unspecified, so there is assert. But for FAILED TO RECEIVE case, we can ignore that assert, because right after calling this function, we will create single node membership anyway.
Result:
No segfaults
|
Story Points: | --- | ||||
Clone Of: | |||||||
: | 875922 (view as bug list) | Environment: | |||||
Last Closed: | 2013-11-21 04:31:26 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Bug Depends On: | |||||||
Bug Blocks: | 835616, 883504, 960054 | ||||||
Attachments: |
|
Description
Jan Pokorný [poki]
2012-09-04 12:05:55 UTC
Honza, Please try the patch in Bug #636583 for Jan. Created attachment 638650 [details]
Proposed patch
If failed_to_recv is set, consensus can be empty
If failed_to_recv is set (node detect itself not able to receive
message), we can end up with assert, because my_failed_list and
my_member_list are same list. This is happening because we are not
following specification and we allow to mark node itself as failed.
Because if failed_to_recv is set and we reached consensus across nodes,
single node membership is created (ignoring both fail list and
member_list), we can skip assert.
Signed-off-by: Jan Friesse <jfriesse>
*** Bug 917914 has been marked as a duplicate of this bug. *** *** Bug 981111 has been marked as a duplicate of this bug. *** Verified using failed-to-receive-crash.sh. On corosync-1.4.1-15.el6.x86_64 (RHEL6.4) FAILED TO RECEIVE messages are logged and (for me) in like 30% of the cases it asserts and dumps a core. On corosync-1.4.1-17.el6.x86_64 (RHEL6.5) the messages are just logged without any core during the test period: Sep 11 15:52:01 virt-014 corosync[22115]: [TOTEM ] FAILED TO RECEIVE Sep 11 15:52:02 virt-014 corosync[22115]: [TOTEM ] A processor joined or left the membership and a new membership was formed. Sep 11 15:52:02 virt-014 corosync[22115]: [CPG ] chosen downlist: sender r(0) ip(10.34.71.14) ; members(old:2 left:1) Sep 11 15:52:02 virt-014 corosync[22115]: [MAIN ] Completed service synchronization, ready to provide service. Sep 11 15:52:03 virt-014 corosync[22115]: [TOTEM ] A processor joined or left the membership and a new membership was formed. Sep 11 15:52:11 virt-014 corosync[22115]: [CPG ] chosen downlist: sender r(0) ip(10.34.71.14) ; members(old:1 left:0) Sep 11 15:52:12 virt-014 corosync[22115]: [MAIN ] Completed service synchronization, ready to provide service. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHBA-2013-1531.html |