Description of problem: during Version-Release number of selected component (if applicable): openais-0.80.6-28.el5 (RHEL5.6) How reproducible: possible running QUICKHIT test (distributed I/O) in cluster Steps to Reproduce: 1. run QUICKHIT test (d_io) 2. 3. Actual results: possible crash of aisexec Expected results: no crashing Additional info: Jan 21 16:25:41 hp-z200-06 openais[16359]: [TOTEM] Retransmit List: 12fc 12fd 12fe 12ff 1300 1301 Jan 21 16:25:41 hp-z200-06 openais[16359]: [TOTEM] Retransmit List: 1304 1305 1306 Jan 21 16:25:41 hp-z200-06 last message repeated 2 times Jan 21 16:25:41 hp-z200-06 openais[16359]: [TOTEM] FAILED TO RECEIVE Core was generated by `aisexec'. Program terminated with signal 6, Aborted. Bug similar to already reported 636583 (Fedora)
Created attachment 474700 [details] aisexec core file
#0 0x00002b6f1814b265 in raise () from /lib64/libc.so.6 (gdb) where #0 0x00002b6f1814b265 in raise () from /lib64/libc.so.6 #1 0x00002b6f1814cd10 in abort () from /lib64/libc.so.6 #2 0x00002b6f181446e6 in __assert_fail () from /lib64/libc.so.6 #3 0x000000000040db18 in memb_consensus_agreed (instance=0x2aaaaaaae010) at totemsrp.c:1110 #4 0x000000000040e0b5 in memb_join_process (instance=0x2aaaaaaae010, memb_join=0x160d5998) at totemsrp.c:3767 #5 0x000000000040e35e in message_handler_memb_join (instance=0x2aaaaaaae010, msg=0x160d5998, msg_len=<value optimized out>, endian_conversion_needed=-1) at totemsrp.c:4004 #6 0x0000000000409f5e in rrp_deliver_fn (context=0x1608a010, msg=0x160d5998, msg_len=332) at totemrrp.c:1319 #7 0x00000000004084eb in net_deliver_fn (handle=<value optimized out>, fd=<value optimized out>, revents=<value optimized out>, data=0x160d52f0) at totemnet.c:695 #8 0x0000000000405d00 in poll_run (handle=0) at aispoll.c:402 #9 0x00000000004188be in main (argc=<value optimized out>, argv=<value optimized out>) at main.c:628 (gdb) up #1 0x00002b6f1814cd10 in abort () from /lib64/libc.so.6 (gdb) up #2 0x00002b6f181446e6 in __assert_fail () from /lib64/libc.so.6 (gdb) up #3 0x000000000040db18 in memb_consensus_agreed (instance=0x2aaaaaaae010) at totemsrp.c:1110 1110 assert (token_memb_entries >= 1); (gdb) print instance->my_proc_list $1 = {{addr = {{nodeid = 2, family = 2, addr = "\n\020@N\b\000\002\000\n\020@N\b\000\004"}, {nodeid = 0, family = 0, addr = '\000' <repeats 15 times>}}}, {addr = {{nodeid = 1, family = 2, addr = "\n\020@\233\b\000\002\000\n\020@\233\b\000\004"}, { nodeid = 0, family = 0, addr = '\000' <repeats 15 times>}}}, {addr = {{ nodeid = 3, family = 2, addr = "\n\020@\371\b\000\002\000\n\020@\371\b\000\004"}, {nodeid = 0, family = 0, addr = '\000' <repeats 15 times>}}}, {addr = {{nodeid = 0, family = 0, addr = '\000' <repeats 15 times>}, {nodeid = 0, family = 0, addr = '\000' <repeats 15 times>}}} <repeats 381 times>} (gdb) print instance->my_proc_list_entries $2 = 3 (gdb) print instance->my_failed_list_entries $3 = 3 (gdb) print m_proc_list[0] No symbol "m_proc_list" in current context. (gdb) print my_proc_list0] No symbol "my_proc_list0" in current context. (gdb) print instance->my_proc_list[0] $4 = {addr = {{nodeid = 2, family = 2, addr = "\n\020@N\b\000\002\000\n\020@N\b\000\004"}, {nodeid = 0, family = 0, addr = '\000' <repeats 15 times>}}} (gdb) print instance->my_proc_list[1] $5 = {addr = {{nodeid = 1, family = 2, addr = "\n\020@\233\b\000\002\000\n\020@\233\b\000\004"}, {nodeid = 0, family = 0, addr = '\000' <repeats 15 times>}}} (gdb) print instance->my_proc_list[2] $6 = {addr = {{nodeid = 3, family = 2, addr = "\n\020@\371\b\000\002\000\n\020@\371\b\000\004"}, {nodeid = 0, family = 0, addr = '\000' <repeats 15 times>}}} (gdb) print instance->my_failed_list[0] $7 = {addr = {{nodeid = 2, family = 2, addr = "\n\020@N\b\000\002\000\n\020@N\b\000\004"}, {nodeid = 0, family = 0, addr = '\000' <repeats 15 times>}}} (gdb) print instance->my_failed_list[1] $8 = {addr = {{nodeid = 1, family = 2, addr = "\n\020@\233\b\000\002\000\n\020@\233\b\000\004"}, {nodeid = 0, family = 0, addr = '\000' <repeats 15 times>}}} (gdb) print instance->my_failed_list[2] $9 = {addr = {{nodeid = 3, family = 2, addr = "\n\020@\371\b\000\002\000\n\020@\371\b\000\004"}, {nodeid = 0, family = 0, addr = '\000' <repeats 15 times>}}}
Jaroslav, how were you getting retransmits? Were you running with packet loss queue disc?
Honza, Please try to duplicate the issue, then try attached patch to see if it resolves the problem.
Created attachment 477494 [details] patch which may fix the problem
Steve, very reliable method to reproduce this bug (same for corosync version of problem): two nodes (A,B) Node A - run corosync Node B - run corosync + cpgverify Node A: (iptables rules must be empty and default policy to accept) iptables -A INPUT -i eth0 -p udp -m limit --limit 10000/s --limit-burst 1 -j ACCEPT iptables -A INPUT -i eth0 -p udp -j DROP Problem appears at most after 5 minutes of testing. Your patch doesn't help.
Honza, Great work on generating a test case! I'll work on sorting out a solution.
*** Bug 631890 has been marked as a duplicate of this bug. ***
One workaround to this problem for corosync users is to set the default for seqno_nchanged_const to some larger value (such as 5000). The default of 30 is too low.
*** Bug 706050 has been marked as a duplicate of this bug. ***
*** Bug 735287 has been marked as a duplicate of this bug. ***
This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux release for currently deployed products. This request is not yet committed for inclusion in a release.
Created attachment 575387 [details] Backported flatiron patch to increase failed to recv constant
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHBA-2013-0013.html