Bug 688691
Summary: | assertion failed "range < 16384" during corosync shutdown | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 6 | Reporter: | Nate Straz <nstraz> | ||||||||
Component: | corosync | Assignee: | Steven Dake <sdake> | ||||||||
Status: | CLOSED ERRATA | QA Contact: | Cluster QE <mspqa-list> | ||||||||
Severity: | urgent | Docs Contact: | |||||||||
Priority: | urgent | ||||||||||
Version: | 6.1 | CC: | cluster-maint, jwest | ||||||||
Target Milestone: | rc | Keywords: | ZStream | ||||||||
Target Release: | --- | ||||||||||
Hardware: | All | ||||||||||
OS: | Linux | ||||||||||
Whiteboard: | |||||||||||
Fixed In Version: | corosync-1.2.3-32.el6 | Doc Type: | Bug Fix | ||||||||
Doc Text: | Story Points: | --- | |||||||||
Clone Of: | Environment: | ||||||||||
Last Closed: | 2011-05-19 14:24:34 UTC | Type: | --- | ||||||||
Regression: | --- | Mount Type: | --- | ||||||||
Documentation: | --- | CRM: | |||||||||
Verified Versions: | Category: | --- | |||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||
Embargoed: | |||||||||||
Bug Depends On: | |||||||||||
Bug Blocks: | 696732 | ||||||||||
Attachments: |
|
highly reproducible test case is as follows: have 5 nodes 1,2,3,4,5. start node 5. Then start node1-4 simultaneously, wait 10 seconds then stop node1-4 simultaneously. keep repeating usually happens in 5-10 runs. The resolution for bug #623176 exposed this problem. node 3,4 stopped (by random stopping) node 1,2,5 form new configuration and during recovery node 1 and node 2 are stopped (via service cman or service corosync stop). This causes 5 never to get a recovery token within the timeout period, triggering a token loss in recovery. Bug #623176 resolved an assert which happens because the full ring id was being restored. The resolution to Bug #623176 was to not restore the full ring id, and instead operate (according to specifications) the new ring id. Unfortunately this exposes a problem whereby the restarting of nodes 1-4 generate the same ring id. This ring id gets to the recovery failed node 5, and triggers a condition not accounted for in the original totem specification. It appears later work from Dr. Agarwal's PHD dissertation considers this scenario. I have attached a patch which implements her solution, which is to essentially reject the regular token in these conditions. Created attachment 486396 [details]
patch to resolve problem
passed 3000 restart scenarios that were failing in under 5 cases before. I'm still hitting this with the new rpm during the whiplash test. [root@buzz-05 corosync]# rpm -q corosync corosync-1.2.3-31.el6.x86_64 (gdb) bt #0 0x0000003d876329e5 in raise (sig=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:64 #1 0x0000003d876341c5 in abort () at abort.c:92 #2 0x0000003d8762b975 in __assert_fail (assertion=0x7f4a3d4cbb82 "range < 16384", file=<value optimized out>, line=2497, function=<value optimized out>) at assert.c:81 #3 0x00007f4a3d4bbbe1 in orf_token_rtr (instance=0x7f4a3a5ee010, orf_token=0x7fff0d222950, fcc_allowed=0x7fff0d22236c) at totemsrp.c:2497 #4 0x00007f4a3d4c0b47 in message_handler_orf_token (instance=0x7f4a3a5ee010, msg=<value optimized out>, msg_len=<value optimized out>, endian_conversion_needed=<value optimized out>) at totemsrp.c:3493 #5 0x00007f4a3d4b7643 in rrp_deliver_fn (context=0x901600, msg=0x94b86c, msg_len=71) at totemrrp.c:1500 #6 0x00007f4a3d4b4136 in net_deliver_fn (handle=<value optimized out>, fd=<value optimized out>, revents=<value optimized out>, data=0x94b1a0) at totemudp.c:1244 #7 0x00007f4a3d4b009a in poll_run (handle=8117261320677490688) at coropoll.c:435 #8 0x0000000000406c6e in main (argc=<value optimized out>, argv=<value optimized out>, envp=<value optimized out>) at main.c:1816 Created attachment 487377 [details]
additional patch to resolve problem
Made it through 500 iterations of whiplash with corosync-1.2.3-32.el6. An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2011-0764.html |
Created attachment 486080 [details] core dump Description of problem: While running our whiplash test (which starts and stops the cluster in a loop) corosync on one node died with an assertion failure: (gdb) bt #0 0x0000003d876329e5 in raise (sig=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:64 #1 0x0000003d876341c5 in abort () at abort.c:92 #2 0x0000003d8762b975 in __assert_fail ( assertion=0x3d89222be2 "range < 16384", file=<value optimized out>, line=2488, function=<value optimized out>) at assert.c:81 #3 0x0000003d89212c61 in orf_token_rtr (instance=0x7f0d34171010, orf_token=0x7fffdefe4fb0, fcc_allowed=0x7fffdefe49cc) at totemsrp.c:2488 #4 0x0000003d89217b88 in message_handler_orf_token (instance=0x7f0d34171010, msg=<value optimized out>, msg_len=<value optimized out>, endian_conversion_needed=<value optimized out>) at totemsrp.c:3481 #5 0x0000003d8920e6c3 in rrp_deliver_fn (context=0x23a4600, msg=0x23ee86c, msg_len=71) at totemrrp.c:1500 #6 0x0000003d8920b136 in net_deliver_fn (handle=<value optimized out>, fd=<value optimized out>, revents=<value optimized out>, data=0x23ee1a0) at totemudp.c:1244 #7 0x0000003d8920709a in poll_run (handle=8117261320677490688) at coropoll.c:435 #8 0x0000000000406c6e in main (argc=<value optimized out>, argv=<value optimized out>, envp=<value optimized out>) at main.c:1816 (gdb) quit Version-Release number of selected component (if applicable): corosync-1.2.3-29.el6 How reproducible: Unknown Actual results: Expected results: Additional info: