Bug 594480
Summary: | All nodes fail during recovery with stack protector sigabrt | ||||||
---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 5 | Reporter: | Shane Bradley <sbradley> | ||||
Component: | openais | Assignee: | Steven Dake <sdake> | ||||
Status: | CLOSED ERRATA | QA Contact: | Cluster QE <mspqa-list> | ||||
Severity: | high | Docs Contact: | |||||
Priority: | high | ||||||
Version: | 5.5 | CC: | cluster-maint, dejohnso, edamato, jkortus, jwest, tao | ||||
Target Milestone: | rc | Keywords: | ZStream | ||||
Target Release: | --- | ||||||
Hardware: | All | ||||||
OS: | Linux | ||||||
Whiteboard: | |||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||
Doc Text: |
In high loss networks, all nodes in a cluster experienced a buffer overflow and aborted when a threshold of unprocessed/not transmitted packets was reached. With this update, even when a significant number of packets is unprocessed/not transmitted, all nodes in a cluster work as expected and do not abort.
|
Story Points: | --- | ||||
Clone Of: | |||||||
: | 600043 (view as bug list) | Environment: | |||||
Last Closed: | 2011-01-13 23:56:46 UTC | Type: | --- | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Bug Depends On: | |||||||
Bug Blocks: | 600043, 601085, 601086 | ||||||
Attachments: |
|
Description
Shane Bradley
2010-05-20 19:33:28 UTC
Based upon analysis of log files in attached bugzilla, node 2 records a FAILED TO RECEIVE. This indicates node 1 was unable to receive multicast traffic for some period of time (~33 seconds - speculation network driver failed on platform). After 33 seconds, (token timeout), the network driver is reopened and everything works fine. The logic of totem in FAILED TO RECV is incorrectly implemented. In this condition, node 1 should have recorded the FAILED TO RECV condition, added itself to the failed list, formed a singleton ring, while the remaining nodes formed a new configuration. Then node 1 should be fenced. I am uncertain how this patch will affect RHCS in this condition, but the current behaviour at the totem level doesn't match protocol specifications (which leads to segfault). I could only replicate it by hard-coding a failure to receive into openais. Created attachment 421038 [details]
revision 2145 from whitetank branch
Technical note added. If any revisions are required, please edit the "Technical Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. New Contents: In high loss networks, all nodes in a cluster experienced a buffer overflow and aborted when a threshold of unprocessed/not transmitted packets was reached. With this update, even when a significant number of packets is unprocessed/not transmitted, all nodes in a cluster work as expected and do not abort. An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2011-0100.html |