Bug 688691 - assertion failed "range < 16384" during corosync shutdown
Summary: assertion failed "range < 16384" during corosync shutdown
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 6
Classification: Red Hat
Component: corosync
Version: 6.1
Hardware: All
OS: Linux
urgent
urgent
Target Milestone: rc
: ---
Assignee: Steven Dake
QA Contact: Cluster QE
URL:
Whiteboard:
Depends On:
Blocks: 696732
TreeView+ depends on / blocked
 
Reported: 2011-03-17 18:03 UTC by Nate Straz
Modified: 2016-04-26 15:40 UTC (History)
2 users (show)

Fixed In Version: corosync-1.2.3-32.el6
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2011-05-19 14:24:34 UTC
Target Upstream Version:


Attachments (Terms of Use)
core dump (159.59 KB, application/x-xz)
2011-03-17 18:03 UTC, Nate Straz
no flags Details
patch to resolve problem (4.80 KB, patch)
2011-03-19 17:46 UTC, Steven Dake
no flags Details | Diff
additional patch to resolve problem (9.55 KB, patch)
2011-03-24 15:38 UTC, Steven Dake
no flags Details | Diff


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2011:0764 0 normal SHIPPED_LIVE corosync bug fix update 2011-05-18 18:08:44 UTC

Description Nate Straz 2011-03-17 18:03:44 UTC
Created attachment 486080 [details]
core dump

Description of problem:

While running our whiplash test (which starts and stops the cluster in a loop) corosync on one node died with an assertion failure:

(gdb) bt
#0  0x0000003d876329e5 in raise (sig=6)
    at ../nptl/sysdeps/unix/sysv/linux/raise.c:64
#1  0x0000003d876341c5 in abort () at abort.c:92
#2  0x0000003d8762b975 in __assert_fail (
    assertion=0x3d89222be2 "range < 16384", file=<value optimized out>,
    line=2488, function=<value optimized out>) at assert.c:81
#3  0x0000003d89212c61 in orf_token_rtr (instance=0x7f0d34171010,
    orf_token=0x7fffdefe4fb0, fcc_allowed=0x7fffdefe49cc) at totemsrp.c:2488
#4  0x0000003d89217b88 in message_handler_orf_token (instance=0x7f0d34171010,
    msg=<value optimized out>, msg_len=<value optimized out>,
    endian_conversion_needed=<value optimized out>) at totemsrp.c:3481
#5  0x0000003d8920e6c3 in rrp_deliver_fn (context=0x23a4600, msg=0x23ee86c,
    msg_len=71) at totemrrp.c:1500
#6  0x0000003d8920b136 in net_deliver_fn (handle=<value optimized out>,
    fd=<value optimized out>, revents=<value optimized out>, data=0x23ee1a0)
    at totemudp.c:1244
#7  0x0000003d8920709a in poll_run (handle=8117261320677490688)
    at coropoll.c:435
#8  0x0000000000406c6e in main (argc=<value optimized out>,
    argv=<value optimized out>, envp=<value optimized out>) at main.c:1816
(gdb) quit


Version-Release number of selected component (if applicable):
corosync-1.2.3-29.el6

How reproducible:
Unknown
 
Actual results:


Expected results:


Additional info:

Comment 2 Steven Dake 2011-03-19 01:23:04 UTC
highly reproducible test case is as follows:

have 5 nodes 1,2,3,4,5.  start node 5.  Then start node1-4 simultaneously, wait 10 seconds then stop node1-4 simultaneously.  keep repeating usually happens in 5-10 runs.

Comment 3 Steven Dake 2011-03-19 01:31:22 UTC
The resolution for bug #623176 exposed this problem.

node 3,4 stopped (by random stopping)
node 1,2,5 form new configuration and during recovery node 1 and node 2 are stopped (via service cman or service corosync stop).  This causes 5 never to get a recovery token within the timeout period, triggering a token loss in recovery.  Bug #623176 resolved an assert which happens because the full ring id was being restored.  The resolution to Bug #623176 was to not restore the full ring id, and instead operate (according to specifications) the new ring id.  Unfortunately this exposes a problem whereby the restarting of nodes 1-4 generate the same ring id.  This ring id gets to the recovery failed node 5, and triggers a condition not accounted for in the original totem specification.

It appears later work from Dr. Agarwal's PHD dissertation considers this scenario.  I have attached a patch which implements her solution, which is to essentially reject the regular token in these conditions.

Comment 4 Steven Dake 2011-03-19 17:46:06 UTC
Created attachment 486396 [details]
patch to resolve problem

Comment 5 Steven Dake 2011-03-19 17:47:14 UTC
passed 3000 restart scenarios that were failing in under 5 cases before.

Comment 7 Nate Straz 2011-03-22 17:48:47 UTC
I'm still  hitting this with the new rpm during the whiplash test.

[root@buzz-05 corosync]# rpm -q corosync
corosync-1.2.3-31.el6.x86_64

(gdb) bt
#0  0x0000003d876329e5 in raise (sig=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:64
#1  0x0000003d876341c5 in abort () at abort.c:92
#2  0x0000003d8762b975 in __assert_fail (assertion=0x7f4a3d4cbb82 "range < 16384", file=<value optimized out>, line=2497, function=<value optimized out>) at assert.c:81
#3  0x00007f4a3d4bbbe1 in orf_token_rtr (instance=0x7f4a3a5ee010, orf_token=0x7fff0d222950, fcc_allowed=0x7fff0d22236c) at totemsrp.c:2497
#4  0x00007f4a3d4c0b47 in message_handler_orf_token (instance=0x7f4a3a5ee010, msg=<value optimized out>, msg_len=<value optimized out>,
    endian_conversion_needed=<value optimized out>) at totemsrp.c:3493
#5  0x00007f4a3d4b7643 in rrp_deliver_fn (context=0x901600, msg=0x94b86c, msg_len=71) at totemrrp.c:1500
#6  0x00007f4a3d4b4136 in net_deliver_fn (handle=<value optimized out>, fd=<value optimized out>, revents=<value optimized out>, data=0x94b1a0) at totemudp.c:1244
#7  0x00007f4a3d4b009a in poll_run (handle=8117261320677490688) at coropoll.c:435
#8  0x0000000000406c6e in main (argc=<value optimized out>, argv=<value optimized out>, envp=<value optimized out>) at main.c:1816

Comment 8 Steven Dake 2011-03-24 15:38:55 UTC
Created attachment 487377 [details]
additional patch to resolve problem

Comment 10 Nate Straz 2011-03-25 11:12:01 UTC
Made it through 500 iterations of whiplash with corosync-1.2.3-32.el6.

Comment 14 errata-xmlrpc 2011-05-19 14:24:34 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2011-0764.html


Note You need to log in before you can comment on or make changes to this bug.