Bug 671575

Summary: [TOTEM] FAILED TO RECEIVE + aisexec crash
Product: Red Hat Enterprise Linux 5 Reporter: Jaroslav Kortus <jkortus>
Component: openaisAssignee: Jan Friesse <jfriesse>
Status: CLOSED ERRATA QA Contact: Cluster QE <mspqa-list>
Severity: high Docs Contact:
Priority: high    
Version: 5.6CC: amoralej, cluster-maint, djansa, edamato, jfriesse, pematous, redhat-bugzilla, sdake, spam, uwe.knop
Target Milestone: rc   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: openais-0.80.6-37.el5 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2013-01-08 07:03:18 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Bug Depends On:    
Bug Blocks: 807971    
Attachments:
Description Flags
aisexec core file
none
patch which may fix the problem
none
Backported flatiron patch to increase failed to recv constant none

Description Jaroslav Kortus 2011-01-21 21:50:09 UTC
Description of problem:
during

Version-Release number of selected component (if applicable):
openais-0.80.6-28.el5 (RHEL5.6)

How reproducible:
possible running QUICKHIT test (distributed I/O) in cluster

Steps to Reproduce:
1. run QUICKHIT test (d_io)
2.
3.
  
Actual results:
possible crash of aisexec

Expected results:
no crashing

Additional info:
Jan 21 16:25:41 hp-z200-06 openais[16359]: [TOTEM] Retransmit List: 12fc 12fd 12fe 12ff 1300 1301
Jan 21 16:25:41 hp-z200-06 openais[16359]: [TOTEM] Retransmit List: 1304 1305 1306
Jan 21 16:25:41 hp-z200-06 last message repeated 2 times
Jan 21 16:25:41 hp-z200-06 openais[16359]: [TOTEM] FAILED TO RECEIVE

Core was generated by `aisexec'.
Program terminated with signal 6, Aborted.

Bug similar to already reported 636583 (Fedora)

Comment 1 Jaroslav Kortus 2011-01-21 21:51:32 UTC
Created attachment 474700 [details]
aisexec core file

Comment 4 Steven Dake 2011-02-07 19:24:13 UTC
#0  0x00002b6f1814b265 in raise () from /lib64/libc.so.6
(gdb) where
#0  0x00002b6f1814b265 in raise () from /lib64/libc.so.6
#1  0x00002b6f1814cd10 in abort () from /lib64/libc.so.6
#2  0x00002b6f181446e6 in __assert_fail () from /lib64/libc.so.6
#3  0x000000000040db18 in memb_consensus_agreed (instance=0x2aaaaaaae010)
    at totemsrp.c:1110
#4  0x000000000040e0b5 in memb_join_process (instance=0x2aaaaaaae010, 
    memb_join=0x160d5998) at totemsrp.c:3767
#5  0x000000000040e35e in message_handler_memb_join (instance=0x2aaaaaaae010, 
    msg=0x160d5998, msg_len=<value optimized out>, endian_conversion_needed=-1)
    at totemsrp.c:4004
#6  0x0000000000409f5e in rrp_deliver_fn (context=0x1608a010, msg=0x160d5998, 
    msg_len=332) at totemrrp.c:1319
#7  0x00000000004084eb in net_deliver_fn (handle=<value optimized out>, 
    fd=<value optimized out>, revents=<value optimized out>, data=0x160d52f0)
    at totemnet.c:695
#8  0x0000000000405d00 in poll_run (handle=0) at aispoll.c:402
#9  0x00000000004188be in main (argc=<value optimized out>, 
    argv=<value optimized out>) at main.c:628
(gdb) up
#1  0x00002b6f1814cd10 in abort () from /lib64/libc.so.6
(gdb) up
#2  0x00002b6f181446e6 in __assert_fail () from /lib64/libc.so.6
(gdb) up
#3  0x000000000040db18 in memb_consensus_agreed (instance=0x2aaaaaaae010)
    at totemsrp.c:1110
1110		assert (token_memb_entries >= 1);
(gdb) print instance->my_proc_list
$1 = {{addr = {{nodeid = 2, family = 2, 
        addr = "\n\020@N\b\000\002\000\n\020@N\b\000\004"}, {nodeid = 0, 
        family = 0, addr = '\000' <repeats 15 times>}}}, {addr = {{nodeid = 1, 
        family = 2, addr = "\n\020@\233\b\000\002\000\n\020@\233\b\000\004"}, {
        nodeid = 0, family = 0, addr = '\000' <repeats 15 times>}}}, {addr = {{
        nodeid = 3, family = 2, 
        addr = "\n\020@\371\b\000\002\000\n\020@\371\b\000\004"}, {nodeid = 0, 
        family = 0, addr = '\000' <repeats 15 times>}}}, {addr = {{nodeid = 0, 
        family = 0, addr = '\000' <repeats 15 times>}, {nodeid = 0, 
        family = 0, addr = '\000' <repeats 15 times>}}} <repeats 381 times>}
(gdb) print instance->my_proc_list_entries
$2 = 3
(gdb) print instance->my_failed_list_entries
$3 = 3
(gdb) print m_proc_list[0]
No symbol "m_proc_list" in current context.
(gdb) print my_proc_list0]
No symbol "my_proc_list0" in current context.
(gdb) print instance->my_proc_list[0]
$4 = {addr = {{nodeid = 2, family = 2, 
      addr = "\n\020@N\b\000\002\000\n\020@N\b\000\004"}, {nodeid = 0, 
      family = 0, addr = '\000' <repeats 15 times>}}}
(gdb) print instance->my_proc_list[1]
$5 = {addr = {{nodeid = 1, family = 2, 
      addr = "\n\020@\233\b\000\002\000\n\020@\233\b\000\004"}, {nodeid = 0, 
      family = 0, addr = '\000' <repeats 15 times>}}}
(gdb) print instance->my_proc_list[2]
$6 = {addr = {{nodeid = 3, family = 2, 
      addr = "\n\020@\371\b\000\002\000\n\020@\371\b\000\004"}, {nodeid = 0, 
      family = 0, addr = '\000' <repeats 15 times>}}}
(gdb) print instance->my_failed_list[0]
$7 = {addr = {{nodeid = 2, family = 2, 
      addr = "\n\020@N\b\000\002\000\n\020@N\b\000\004"}, {nodeid = 0, 
      family = 0, addr = '\000' <repeats 15 times>}}}
(gdb) print instance->my_failed_list[1]
$8 = {addr = {{nodeid = 1, family = 2, 
      addr = "\n\020@\233\b\000\002\000\n\020@\233\b\000\004"}, {nodeid = 0, 
      family = 0, addr = '\000' <repeats 15 times>}}}
(gdb) print instance->my_failed_list[2]
$9 = {addr = {{nodeid = 3, family = 2, 
      addr = "\n\020@\371\b\000\002\000\n\020@\371\b\000\004"}, {nodeid = 0, 
      family = 0, addr = '\000' <repeats 15 times>}}}

Comment 5 Steven Dake 2011-02-07 19:25:21 UTC
Jaroslav, 

how were you getting retransmits?  Were you running with packet loss queue disc?

Comment 6 Steven Dake 2011-02-07 20:16:44 UTC
Honza,

Please try to duplicate the issue, then try attached patch to see if it resolves the problem.

Comment 7 Steven Dake 2011-02-07 20:17:14 UTC
Created attachment 477494 [details]
patch which may fix the problem

Comment 9 Jan Friesse 2011-02-09 14:46:14 UTC
Steve,
very reliable method to reproduce this bug (same for corosync version of problem):
two nodes (A,B)

Node A - run corosync
Node B - run corosync + cpgverify

Node A:
(iptables rules must be empty and default policy to accept)

iptables -A INPUT -i eth0 -p udp -m limit --limit 10000/s --limit-burst 1 -j ACCEPT 
iptables -A INPUT -i eth0 -p udp -j DROP

Problem appears at most after 5 minutes of testing.

Your patch doesn't help.

Comment 10 Steven Dake 2011-02-09 19:01:16 UTC
Honza,

Great work on generating a test case!  I'll work on sorting out a solution.

Comment 11 Jan Friesse 2011-02-21 08:39:07 UTC
*** Bug 631890 has been marked as a duplicate of this bug. ***

Comment 12 Steven Dake 2011-03-15 17:01:51 UTC
One workaround to this problem for corosync users is to set the default for seqno_nchanged_const to some larger value (such as 5000).  The default of 30 is too low.

Comment 14 Jan Friesse 2011-05-19 12:00:41 UTC
*** Bug 706050 has been marked as a duplicate of this bug. ***

Comment 20 Steven Dake 2011-09-02 17:04:22 UTC
*** Bug 735287 has been marked as a duplicate of this bug. ***

Comment 23 RHEL Program Management 2012-04-02 10:47:22 UTC
This request was evaluated by Red Hat Product Management for inclusion
in a Red Hat Enterprise Linux release.  Product Management has
requested further review of this request by Red Hat Engineering, for
potential inclusion in a Red Hat Enterprise Linux release for currently
deployed products.  This request is not yet committed for inclusion in
a release.

Comment 24 Jan Friesse 2012-04-05 12:24:56 UTC
Created attachment 575387 [details]
Backported flatiron patch to increase failed to recv constant

Comment 34 errata-xmlrpc 2013-01-08 07:03:18 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2013-0013.html