RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.
Bug 791235 - A rare condition can lead to fail to recv
Summary: A rare condition can lead to fail to recv
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 6
Classification: Red Hat
Component: corosync
Version: 6.2
Hardware: All
OS: Linux
urgent
urgent
Target Milestone: rc
: ---
Assignee: Jan Friesse
QA Contact: Cluster QE
URL:
Whiteboard:
Depends On: 758209 806901
Blocks:
TreeView+ depends on / blocked
 
Reported: 2012-02-16 15:10 UTC by RHEL Program Management
Modified: 2012-03-26 13:11 UTC (History)
7 users (show)

Fixed In Version: corosync-1.2.3-36.el6_1.4
Doc Type: Bug Fix
Doc Text:
Previously, the range condition for the update_aru() function could cause incorrect check of message IDs. Due to this, in rare cases, the corosync utility entered the "FAILED TO RECEIVE" state, and so failed to receive multicast packets. With this update, the range value in the update_aru() function is no longer checked for; the fail_to_recv_const constant performs such checks. Now, corosync does not fail to receive packets.
Clone Of:
Environment:
Last Closed: 2012-03-08 14:20:24 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
6.1.z-bz791235-1-From-Yunkai-Zhang (2.20 KB, patch)
2012-03-02 08:48 UTC, Jan Friesse
no flags Details | Diff


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2012:0375 0 normal SHIPPED_LIVE corosync bug fix update 2012-03-08 19:18:55 UTC

Description RHEL Program Management 2012-02-16 15:10:42 UTC
This bug has been copied from bug #758209 and has been proposed
to be backported to 6.1 z-stream (EUS).

Comment 4 Jan Friesse 2012-03-02 08:48:29 UTC
Created attachment 567008 [details]
6.1.z-bz791235-1-From-Yunkai-Zhang


From: Yunkai Zhang:
 Today, I have observed one of the reason that corosync running into
 FAILED TO RECEIVE state.

There was five nodes(A,B,C,D,E) in my testing, and I limited the UDP
transmission rate of C nodes by iptables command:
iptables -A INPUT -i eth0 -p udp -m limit --limit 10000/s
--limit-burst 1 -j ACCEPT
iptables -A INPUT -i eth0 -p udp -j DROP

After one hour later, C node had been missing some MCAST messages,
it's state described as following:
==state of C node==
my_aru:0x805
my_high_seq_received:0xC2C
my_aru_count:7

=>receved MCAST message with seq:806 from B nodes
=>enter *message_handler_mcast*
  =>add this message to regular_sort_queue
  ...
  =>enter *update_aru* function
    => range = (my_high_seq_received - my_aru)
             = (0xC2C - 0x805)
             = 1063
    => if range>1024, do nothing and and return directly.
==END==

According this logic, after (my_high_req_received-my_aru)>1024, my_aru
will not be updated though corosync can receive MCAST messages
retransmitted by other nodes.

But at that timte, my_aru_count was only 7. So the corosync at C node
would keep in this status until my_aru_count increased to
fail_to_recv_const(the default value is 2500). This was a long time
for corosync, but we wasted it.

To solve this issue, maybe we can enlarge the range condition in
update_aru function? Or we just ingnore the checking of range value,
it seems no harmfull, because we have been using fail_to_recv_const to
control the things.

Signed-off-by: Steven Dake <sdake>
Reviewed-by: Jan Friesse <jfriesse>
(cherry picked from commit e48ddf99a67754dea056a54f404f3638cf829b9c)

Comment 6 Jan Friesse 2012-03-07 08:53:21 UTC
    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
Cause
Corosync contains improper check of messages ids

Consequence
Corosync can fail with message "FAILED TO RECEIVE" in some rare scenarios

Fix
Remove improper check

Result
Corosync no longer fails in specified scenarion with "FAILED TO RECEIVE"

Comment 8 Eliska Slobodova 2012-03-08 13:12:26 UTC
    Technical note updated. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    Diffed Contents:
@@ -1,11 +1 @@
-Cause
+Previously, the range condition for the update_aru() function could cause incorrect check of message IDs. Due to this, in rare cases, the corosync utility entered the "FAILED TO RECEIVE" state, and so failed to receive multicast packets. With this update, the range value in the update_aru() function is no longer checked for; the fail_to_recv_const constant performs such checks. Now, corosync does not fail to receive packets.-Corosync contains improper check of messages ids
-
-Consequence
-Corosync can fail with message "FAILED TO RECEIVE" in some rare scenarios
-
-Fix
-Remove improper check
-
-Result
-Corosync no longer fails in specified scenarion with "FAILED TO RECEIVE"

Comment 9 errata-xmlrpc 2012-03-08 14:20:24 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2012-0375.html


Note You need to log in before you can comment on or make changes to this bug.