Bug 688691

Summary: assertion failed "range < 16384" during corosync shutdown
Product: Red Hat Enterprise Linux 6 Reporter: Nate Straz <nstraz>
Component: corosyncAssignee: Steven Dake <sdake>
Status: CLOSED ERRATA QA Contact: Cluster QE <mspqa-list>
Severity: urgent Docs Contact:
Priority: urgent    
Version: 6.1CC: cluster-maint, jwest
Target Milestone: rcKeywords: ZStream
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: corosync-1.2.3-32.el6 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2011-05-19 10:24:34 EDT Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---
Bug Depends On:    
Bug Blocks: 696732    
Attachments:
Description Flags
core dump
none
patch to resolve problem
none
additional patch to resolve problem none

Description Nate Straz 2011-03-17 14:03:44 EDT
Created attachment 486080 [details]
core dump

Description of problem:

While running our whiplash test (which starts and stops the cluster in a loop) corosync on one node died with an assertion failure:

(gdb) bt
#0  0x0000003d876329e5 in raise (sig=6)
    at ../nptl/sysdeps/unix/sysv/linux/raise.c:64
#1  0x0000003d876341c5 in abort () at abort.c:92
#2  0x0000003d8762b975 in __assert_fail (
    assertion=0x3d89222be2 "range < 16384", file=<value optimized out>,
    line=2488, function=<value optimized out>) at assert.c:81
#3  0x0000003d89212c61 in orf_token_rtr (instance=0x7f0d34171010,
    orf_token=0x7fffdefe4fb0, fcc_allowed=0x7fffdefe49cc) at totemsrp.c:2488
#4  0x0000003d89217b88 in message_handler_orf_token (instance=0x7f0d34171010,
    msg=<value optimized out>, msg_len=<value optimized out>,
    endian_conversion_needed=<value optimized out>) at totemsrp.c:3481
#5  0x0000003d8920e6c3 in rrp_deliver_fn (context=0x23a4600, msg=0x23ee86c,
    msg_len=71) at totemrrp.c:1500
#6  0x0000003d8920b136 in net_deliver_fn (handle=<value optimized out>,
    fd=<value optimized out>, revents=<value optimized out>, data=0x23ee1a0)
    at totemudp.c:1244
#7  0x0000003d8920709a in poll_run (handle=8117261320677490688)
    at coropoll.c:435
#8  0x0000000000406c6e in main (argc=<value optimized out>,
    argv=<value optimized out>, envp=<value optimized out>) at main.c:1816
(gdb) quit


Version-Release number of selected component (if applicable):
corosync-1.2.3-29.el6

How reproducible:
Unknown
 
Actual results:


Expected results:


Additional info:
Comment 2 Steven Dake 2011-03-18 21:23:04 EDT
highly reproducible test case is as follows:

have 5 nodes 1,2,3,4,5.  start node 5.  Then start node1-4 simultaneously, wait 10 seconds then stop node1-4 simultaneously.  keep repeating usually happens in 5-10 runs.
Comment 3 Steven Dake 2011-03-18 21:31:22 EDT
The resolution for bug #623176 exposed this problem.

node 3,4 stopped (by random stopping)
node 1,2,5 form new configuration and during recovery node 1 and node 2 are stopped (via service cman or service corosync stop).  This causes 5 never to get a recovery token within the timeout period, triggering a token loss in recovery.  Bug #623176 resolved an assert which happens because the full ring id was being restored.  The resolution to Bug #623176 was to not restore the full ring id, and instead operate (according to specifications) the new ring id.  Unfortunately this exposes a problem whereby the restarting of nodes 1-4 generate the same ring id.  This ring id gets to the recovery failed node 5, and triggers a condition not accounted for in the original totem specification.

It appears later work from Dr. Agarwal's PHD dissertation considers this scenario.  I have attached a patch which implements her solution, which is to essentially reject the regular token in these conditions.
Comment 4 Steven Dake 2011-03-19 13:46:06 EDT
Created attachment 486396 [details]
patch to resolve problem
Comment 5 Steven Dake 2011-03-19 13:47:14 EDT
passed 3000 restart scenarios that were failing in under 5 cases before.
Comment 7 Nate Straz 2011-03-22 13:48:47 EDT
I'm still  hitting this with the new rpm during the whiplash test.

[root@buzz-05 corosync]# rpm -q corosync
corosync-1.2.3-31.el6.x86_64

(gdb) bt
#0  0x0000003d876329e5 in raise (sig=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:64
#1  0x0000003d876341c5 in abort () at abort.c:92
#2  0x0000003d8762b975 in __assert_fail (assertion=0x7f4a3d4cbb82 "range < 16384", file=<value optimized out>, line=2497, function=<value optimized out>) at assert.c:81
#3  0x00007f4a3d4bbbe1 in orf_token_rtr (instance=0x7f4a3a5ee010, orf_token=0x7fff0d222950, fcc_allowed=0x7fff0d22236c) at totemsrp.c:2497
#4  0x00007f4a3d4c0b47 in message_handler_orf_token (instance=0x7f4a3a5ee010, msg=<value optimized out>, msg_len=<value optimized out>,
    endian_conversion_needed=<value optimized out>) at totemsrp.c:3493
#5  0x00007f4a3d4b7643 in rrp_deliver_fn (context=0x901600, msg=0x94b86c, msg_len=71) at totemrrp.c:1500
#6  0x00007f4a3d4b4136 in net_deliver_fn (handle=<value optimized out>, fd=<value optimized out>, revents=<value optimized out>, data=0x94b1a0) at totemudp.c:1244
#7  0x00007f4a3d4b009a in poll_run (handle=8117261320677490688) at coropoll.c:435
#8  0x0000000000406c6e in main (argc=<value optimized out>, argv=<value optimized out>, envp=<value optimized out>) at main.c:1816
Comment 8 Steven Dake 2011-03-24 11:38:55 EDT
Created attachment 487377 [details]
additional patch to resolve problem
Comment 10 Nate Straz 2011-03-25 07:12:01 EDT
Made it through 500 iterations of whiplash with corosync-1.2.3-32.el6.
Comment 14 errata-xmlrpc 2011-05-19 10:24:34 EDT
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2011-0764.html