Bug 791235
Summary: | A rare condition can lead to fail to recv | ||||||
---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 6 | Reporter: | RHEL Program Management <pm-rhel> | ||||
Component: | corosync | Assignee: | Jan Friesse <jfriesse> | ||||
Status: | CLOSED ERRATA | QA Contact: | Cluster QE <mspqa-list> | ||||
Severity: | urgent | Docs Contact: | |||||
Priority: | urgent | ||||||
Version: | 6.2 | CC: | cluster-maint, jfriesse, jkortus, jwest, mjuricek, pm-eus, sdake | ||||
Target Milestone: | rc | Keywords: | ZStream | ||||
Target Release: | --- | ||||||
Hardware: | All | ||||||
OS: | Linux | ||||||
Whiteboard: | |||||||
Fixed In Version: | corosync-1.2.3-36.el6_1.4 | Doc Type: | Bug Fix | ||||
Doc Text: |
Previously, the range condition for the update_aru() function could cause incorrect check of message IDs. Due to this, in rare cases, the corosync utility entered the "FAILED TO RECEIVE" state, and so failed to receive multicast packets. With this update, the range value in the update_aru() function is no longer checked for; the fail_to_recv_const constant performs such checks. Now, corosync does not fail to receive packets.
|
Story Points: | --- | ||||
Clone Of: | Environment: | ||||||
Last Closed: | 2012-03-08 14:20:24 UTC | Type: | --- | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Bug Depends On: | 758209, 806901 | ||||||
Bug Blocks: | |||||||
Attachments: |
|
Description
RHEL Program Management
2012-02-16 15:10:42 UTC
Created attachment 567008 [details] 6.1.z-bz791235-1-From-Yunkai-Zhang From: Yunkai Zhang: Today, I have observed one of the reason that corosync running into FAILED TO RECEIVE state. There was five nodes(A,B,C,D,E) in my testing, and I limited the UDP transmission rate of C nodes by iptables command: iptables -A INPUT -i eth0 -p udp -m limit --limit 10000/s --limit-burst 1 -j ACCEPT iptables -A INPUT -i eth0 -p udp -j DROP After one hour later, C node had been missing some MCAST messages, it's state described as following: ==state of C node== my_aru:0x805 my_high_seq_received:0xC2C my_aru_count:7 =>receved MCAST message with seq:806 from B nodes =>enter *message_handler_mcast* =>add this message to regular_sort_queue ... =>enter *update_aru* function => range = (my_high_seq_received - my_aru) = (0xC2C - 0x805) = 1063 => if range>1024, do nothing and and return directly. ==END== According this logic, after (my_high_req_received-my_aru)>1024, my_aru will not be updated though corosync can receive MCAST messages retransmitted by other nodes. But at that timte, my_aru_count was only 7. So the corosync at C node would keep in this status until my_aru_count increased to fail_to_recv_const(the default value is 2500). This was a long time for corosync, but we wasted it. To solve this issue, maybe we can enlarge the range condition in update_aru function? Or we just ingnore the checking of range value, it seems no harmfull, because we have been using fail_to_recv_const to control the things. Signed-off-by: Steven Dake <sdake> Reviewed-by: Jan Friesse <jfriesse> (cherry picked from commit e48ddf99a67754dea056a54f404f3638cf829b9c) Technical note added. If any revisions are required, please edit the "Technical Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. New Contents: Cause Corosync contains improper check of messages ids Consequence Corosync can fail with message "FAILED TO RECEIVE" in some rare scenarios Fix Remove improper check Result Corosync no longer fails in specified scenarion with "FAILED TO RECEIVE" Technical note updated. If any revisions are required, please edit the "Technical Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. Diffed Contents: @@ -1,11 +1 @@ -Cause +Previously, the range condition for the update_aru() function could cause incorrect check of message IDs. Due to this, in rare cases, the corosync utility entered the "FAILED TO RECEIVE" state, and so failed to receive multicast packets. With this update, the range value in the update_aru() function is no longer checked for; the fail_to_recv_const constant performs such checks. Now, corosync does not fail to receive packets.-Corosync contains improper check of messages ids - -Consequence -Corosync can fail with message "FAILED TO RECEIVE" in some rare scenarios - -Fix -Remove improper check - -Result -Corosync no longer fails in specified scenarion with "FAILED TO RECEIVE" Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHBA-2012-0375.html |