Bug 671575 - [TOTEM] FAILED TO RECEIVE + aisexec crash
Summary: [TOTEM] FAILED TO RECEIVE + aisexec crash
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: openais
Version: 5.6
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: rc
: ---
Assignee: Jan Friesse
QA Contact: Cluster QE
URL:
Whiteboard:
: 631890 706050 735287 (view as bug list)
Depends On:
Blocks: 807971
TreeView+ depends on / blocked
 
Reported: 2011-01-21 21:50 UTC by Jaroslav Kortus
Modified: 2013-01-08 07:03 UTC (History)
10 users (show)

Fixed In Version: openais-0.80.6-37.el5
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2013-01-08 07:03:18 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
aisexec core file (633.13 KB, application/x-gzip)
2011-01-21 21:51 UTC, Jaroslav Kortus
no flags Details
patch which may fix the problem (1.05 KB, patch)
2011-02-07 20:17 UTC, Steven Dake
no flags Details | Diff
Backported flatiron patch to increase failed to recv constant (1.85 KB, patch)
2012-04-05 12:24 UTC, Jan Friesse
no flags Details | Diff


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2013:0013 0 normal SHIPPED_LIVE openais bug fix update 2013-01-07 15:30:06 UTC

Description Jaroslav Kortus 2011-01-21 21:50:09 UTC
Description of problem:
during

Version-Release number of selected component (if applicable):
openais-0.80.6-28.el5 (RHEL5.6)

How reproducible:
possible running QUICKHIT test (distributed I/O) in cluster

Steps to Reproduce:
1. run QUICKHIT test (d_io)
2.
3.
  
Actual results:
possible crash of aisexec

Expected results:
no crashing

Additional info:
Jan 21 16:25:41 hp-z200-06 openais[16359]: [TOTEM] Retransmit List: 12fc 12fd 12fe 12ff 1300 1301
Jan 21 16:25:41 hp-z200-06 openais[16359]: [TOTEM] Retransmit List: 1304 1305 1306
Jan 21 16:25:41 hp-z200-06 last message repeated 2 times
Jan 21 16:25:41 hp-z200-06 openais[16359]: [TOTEM] FAILED TO RECEIVE

Core was generated by `aisexec'.
Program terminated with signal 6, Aborted.

Bug similar to already reported 636583 (Fedora)

Comment 1 Jaroslav Kortus 2011-01-21 21:51:32 UTC
Created attachment 474700 [details]
aisexec core file

Comment 4 Steven Dake 2011-02-07 19:24:13 UTC
#0  0x00002b6f1814b265 in raise () from /lib64/libc.so.6
(gdb) where
#0  0x00002b6f1814b265 in raise () from /lib64/libc.so.6
#1  0x00002b6f1814cd10 in abort () from /lib64/libc.so.6
#2  0x00002b6f181446e6 in __assert_fail () from /lib64/libc.so.6
#3  0x000000000040db18 in memb_consensus_agreed (instance=0x2aaaaaaae010)
    at totemsrp.c:1110
#4  0x000000000040e0b5 in memb_join_process (instance=0x2aaaaaaae010, 
    memb_join=0x160d5998) at totemsrp.c:3767
#5  0x000000000040e35e in message_handler_memb_join (instance=0x2aaaaaaae010, 
    msg=0x160d5998, msg_len=<value optimized out>, endian_conversion_needed=-1)
    at totemsrp.c:4004
#6  0x0000000000409f5e in rrp_deliver_fn (context=0x1608a010, msg=0x160d5998, 
    msg_len=332) at totemrrp.c:1319
#7  0x00000000004084eb in net_deliver_fn (handle=<value optimized out>, 
    fd=<value optimized out>, revents=<value optimized out>, data=0x160d52f0)
    at totemnet.c:695
#8  0x0000000000405d00 in poll_run (handle=0) at aispoll.c:402
#9  0x00000000004188be in main (argc=<value optimized out>, 
    argv=<value optimized out>) at main.c:628
(gdb) up
#1  0x00002b6f1814cd10 in abort () from /lib64/libc.so.6
(gdb) up
#2  0x00002b6f181446e6 in __assert_fail () from /lib64/libc.so.6
(gdb) up
#3  0x000000000040db18 in memb_consensus_agreed (instance=0x2aaaaaaae010)
    at totemsrp.c:1110
1110		assert (token_memb_entries >= 1);
(gdb) print instance->my_proc_list
$1 = {{addr = {{nodeid = 2, family = 2, 
        addr = "\n\020@N\b\000\002\000\n\020@N\b\000\004"}, {nodeid = 0, 
        family = 0, addr = '\000' <repeats 15 times>}}}, {addr = {{nodeid = 1, 
        family = 2, addr = "\n\020@\233\b\000\002\000\n\020@\233\b\000\004"}, {
        nodeid = 0, family = 0, addr = '\000' <repeats 15 times>}}}, {addr = {{
        nodeid = 3, family = 2, 
        addr = "\n\020@\371\b\000\002\000\n\020@\371\b\000\004"}, {nodeid = 0, 
        family = 0, addr = '\000' <repeats 15 times>}}}, {addr = {{nodeid = 0, 
        family = 0, addr = '\000' <repeats 15 times>}, {nodeid = 0, 
        family = 0, addr = '\000' <repeats 15 times>}}} <repeats 381 times>}
(gdb) print instance->my_proc_list_entries
$2 = 3
(gdb) print instance->my_failed_list_entries
$3 = 3
(gdb) print m_proc_list[0]
No symbol "m_proc_list" in current context.
(gdb) print my_proc_list0]
No symbol "my_proc_list0" in current context.
(gdb) print instance->my_proc_list[0]
$4 = {addr = {{nodeid = 2, family = 2, 
      addr = "\n\020@N\b\000\002\000\n\020@N\b\000\004"}, {nodeid = 0, 
      family = 0, addr = '\000' <repeats 15 times>}}}
(gdb) print instance->my_proc_list[1]
$5 = {addr = {{nodeid = 1, family = 2, 
      addr = "\n\020@\233\b\000\002\000\n\020@\233\b\000\004"}, {nodeid = 0, 
      family = 0, addr = '\000' <repeats 15 times>}}}
(gdb) print instance->my_proc_list[2]
$6 = {addr = {{nodeid = 3, family = 2, 
      addr = "\n\020@\371\b\000\002\000\n\020@\371\b\000\004"}, {nodeid = 0, 
      family = 0, addr = '\000' <repeats 15 times>}}}
(gdb) print instance->my_failed_list[0]
$7 = {addr = {{nodeid = 2, family = 2, 
      addr = "\n\020@N\b\000\002\000\n\020@N\b\000\004"}, {nodeid = 0, 
      family = 0, addr = '\000' <repeats 15 times>}}}
(gdb) print instance->my_failed_list[1]
$8 = {addr = {{nodeid = 1, family = 2, 
      addr = "\n\020@\233\b\000\002\000\n\020@\233\b\000\004"}, {nodeid = 0, 
      family = 0, addr = '\000' <repeats 15 times>}}}
(gdb) print instance->my_failed_list[2]
$9 = {addr = {{nodeid = 3, family = 2, 
      addr = "\n\020@\371\b\000\002\000\n\020@\371\b\000\004"}, {nodeid = 0, 
      family = 0, addr = '\000' <repeats 15 times>}}}

Comment 5 Steven Dake 2011-02-07 19:25:21 UTC
Jaroslav, 

how were you getting retransmits?  Were you running with packet loss queue disc?

Comment 6 Steven Dake 2011-02-07 20:16:44 UTC
Honza,

Please try to duplicate the issue, then try attached patch to see if it resolves the problem.

Comment 7 Steven Dake 2011-02-07 20:17:14 UTC
Created attachment 477494 [details]
patch which may fix the problem

Comment 9 Jan Friesse 2011-02-09 14:46:14 UTC
Steve,
very reliable method to reproduce this bug (same for corosync version of problem):
two nodes (A,B)

Node A - run corosync
Node B - run corosync + cpgverify

Node A:
(iptables rules must be empty and default policy to accept)

iptables -A INPUT -i eth0 -p udp -m limit --limit 10000/s --limit-burst 1 -j ACCEPT 
iptables -A INPUT -i eth0 -p udp -j DROP

Problem appears at most after 5 minutes of testing.

Your patch doesn't help.

Comment 10 Steven Dake 2011-02-09 19:01:16 UTC
Honza,

Great work on generating a test case!  I'll work on sorting out a solution.

Comment 11 Jan Friesse 2011-02-21 08:39:07 UTC
*** Bug 631890 has been marked as a duplicate of this bug. ***

Comment 12 Steven Dake 2011-03-15 17:01:51 UTC
One workaround to this problem for corosync users is to set the default for seqno_nchanged_const to some larger value (such as 5000).  The default of 30 is too low.

Comment 14 Jan Friesse 2011-05-19 12:00:41 UTC
*** Bug 706050 has been marked as a duplicate of this bug. ***

Comment 20 Steven Dake 2011-09-02 17:04:22 UTC
*** Bug 735287 has been marked as a duplicate of this bug. ***

Comment 23 RHEL Program Management 2012-04-02 10:47:22 UTC
This request was evaluated by Red Hat Product Management for inclusion
in a Red Hat Enterprise Linux release.  Product Management has
requested further review of this request by Red Hat Engineering, for
potential inclusion in a Red Hat Enterprise Linux release for currently
deployed products.  This request is not yet committed for inclusion in
a release.

Comment 24 Jan Friesse 2012-04-05 12:24:56 UTC
Created attachment 575387 [details]
Backported flatiron patch to increase failed to recv constant

Comment 34 errata-xmlrpc 2013-01-08 07:03:18 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2013-0013.html


Note You need to log in before you can comment on or make changes to this bug.