Red Hat Bugzilla – Full Text Bug Listing
|Summary:||assertion failed "range < 16384" during corosync shutdown|
|Product:||Red Hat Enterprise Linux 6||Reporter:||Nate Straz <nstraz>|
|Component:||corosync||Assignee:||Steven Dake <sdake>|
|Status:||CLOSED ERRATA||QA Contact:||Cluster QE <mspqa-list>|
|Fixed In Version:||corosync-1.2.3-32.el6||Doc Type:||Bug Fix|
|Doc Text:||Story Points:||---|
|Last Closed:||2011-05-19 10:24:34 EDT||Type:||---|
|oVirt Team:||---||RHEL 7.3 requirements from Atomic Host:|
|Bug Depends On:|
Description Nate Straz 2011-03-17 14:03:44 EDT
Created attachment 486080 [details] core dump Description of problem: While running our whiplash test (which starts and stops the cluster in a loop) corosync on one node died with an assertion failure: (gdb) bt #0 0x0000003d876329e5 in raise (sig=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:64 #1 0x0000003d876341c5 in abort () at abort.c:92 #2 0x0000003d8762b975 in __assert_fail ( assertion=0x3d89222be2 "range < 16384", file=<value optimized out>, line=2488, function=<value optimized out>) at assert.c:81 #3 0x0000003d89212c61 in orf_token_rtr (instance=0x7f0d34171010, orf_token=0x7fffdefe4fb0, fcc_allowed=0x7fffdefe49cc) at totemsrp.c:2488 #4 0x0000003d89217b88 in message_handler_orf_token (instance=0x7f0d34171010, msg=<value optimized out>, msg_len=<value optimized out>, endian_conversion_needed=<value optimized out>) at totemsrp.c:3481 #5 0x0000003d8920e6c3 in rrp_deliver_fn (context=0x23a4600, msg=0x23ee86c, msg_len=71) at totemrrp.c:1500 #6 0x0000003d8920b136 in net_deliver_fn (handle=<value optimized out>, fd=<value optimized out>, revents=<value optimized out>, data=0x23ee1a0) at totemudp.c:1244 #7 0x0000003d8920709a in poll_run (handle=8117261320677490688) at coropoll.c:435 #8 0x0000000000406c6e in main (argc=<value optimized out>, argv=<value optimized out>, envp=<value optimized out>) at main.c:1816 (gdb) quit Version-Release number of selected component (if applicable): corosync-1.2.3-29.el6 How reproducible: Unknown Actual results: Expected results: Additional info:
Comment 2 Steven Dake 2011-03-18 21:23:04 EDT
highly reproducible test case is as follows: have 5 nodes 1,2,3,4,5. start node 5. Then start node1-4 simultaneously, wait 10 seconds then stop node1-4 simultaneously. keep repeating usually happens in 5-10 runs.
Comment 3 Steven Dake 2011-03-18 21:31:22 EDT
The resolution for bug #623176 exposed this problem. node 3,4 stopped (by random stopping) node 1,2,5 form new configuration and during recovery node 1 and node 2 are stopped (via service cman or service corosync stop). This causes 5 never to get a recovery token within the timeout period, triggering a token loss in recovery. Bug #623176 resolved an assert which happens because the full ring id was being restored. The resolution to Bug #623176 was to not restore the full ring id, and instead operate (according to specifications) the new ring id. Unfortunately this exposes a problem whereby the restarting of nodes 1-4 generate the same ring id. This ring id gets to the recovery failed node 5, and triggers a condition not accounted for in the original totem specification. It appears later work from Dr. Agarwal's PHD dissertation considers this scenario. I have attached a patch which implements her solution, which is to essentially reject the regular token in these conditions.
Comment 4 Steven Dake 2011-03-19 13:46:06 EDT
Created attachment 486396 [details] patch to resolve problem
Comment 5 Steven Dake 2011-03-19 13:47:14 EDT
passed 3000 restart scenarios that were failing in under 5 cases before.
Comment 7 Nate Straz 2011-03-22 13:48:47 EDT
I'm still hitting this with the new rpm during the whiplash test. [root@buzz-05 corosync]# rpm -q corosync corosync-1.2.3-31.el6.x86_64 (gdb) bt #0 0x0000003d876329e5 in raise (sig=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:64 #1 0x0000003d876341c5 in abort () at abort.c:92 #2 0x0000003d8762b975 in __assert_fail (assertion=0x7f4a3d4cbb82 "range < 16384", file=<value optimized out>, line=2497, function=<value optimized out>) at assert.c:81 #3 0x00007f4a3d4bbbe1 in orf_token_rtr (instance=0x7f4a3a5ee010, orf_token=0x7fff0d222950, fcc_allowed=0x7fff0d22236c) at totemsrp.c:2497 #4 0x00007f4a3d4c0b47 in message_handler_orf_token (instance=0x7f4a3a5ee010, msg=<value optimized out>, msg_len=<value optimized out>, endian_conversion_needed=<value optimized out>) at totemsrp.c:3493 #5 0x00007f4a3d4b7643 in rrp_deliver_fn (context=0x901600, msg=0x94b86c, msg_len=71) at totemrrp.c:1500 #6 0x00007f4a3d4b4136 in net_deliver_fn (handle=<value optimized out>, fd=<value optimized out>, revents=<value optimized out>, data=0x94b1a0) at totemudp.c:1244 #7 0x00007f4a3d4b009a in poll_run (handle=8117261320677490688) at coropoll.c:435 #8 0x0000000000406c6e in main (argc=<value optimized out>, argv=<value optimized out>, envp=<value optimized out>) at main.c:1816
Comment 8 Steven Dake 2011-03-24 11:38:55 EDT
Created attachment 487377 [details] additional patch to resolve problem
Comment 10 Nate Straz 2011-03-25 07:12:01 EDT
Made it through 500 iterations of whiplash with corosync-1.2.3-32.el6.
Comment 14 errata-xmlrpc 2011-05-19 10:24:34 EDT
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2011-0764.html