RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.
Bug 688691 - assertion failed "range < 16384" during corosync shutdown
Summary: assertion failed "range < 16384" during corosync shutdown
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 6
Classification: Red Hat
Component: corosync
Version: 6.1
Hardware: All
OS: Linux
urgent
urgent
Target Milestone: rc
: ---
Assignee: Steven Dake
QA Contact: Cluster QE
URL:
Whiteboard:
Depends On:
Blocks: 696732
TreeView+ depends on / blocked
 
Reported: 2011-03-17 18:03 UTC by Nate Straz
Modified: 2016-04-26 15:40 UTC (History)
2 users (show)

Fixed In Version: corosync-1.2.3-32.el6
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2011-05-19 14:24:34 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
core dump (159.59 KB, application/x-xz)
2011-03-17 18:03 UTC, Nate Straz
no flags Details
patch to resolve problem (4.80 KB, patch)
2011-03-19 17:46 UTC, Steven Dake
no flags Details | Diff
additional patch to resolve problem (9.55 KB, patch)
2011-03-24 15:38 UTC, Steven Dake
no flags Details | Diff


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2011:0764 0 normal SHIPPED_LIVE corosync bug fix update 2011-05-18 18:08:44 UTC

Description Nate Straz 2011-03-17 18:03:44 UTC
Created attachment 486080 [details]
core dump

Description of problem:

While running our whiplash test (which starts and stops the cluster in a loop) corosync on one node died with an assertion failure:

(gdb) bt
#0  0x0000003d876329e5 in raise (sig=6)
    at ../nptl/sysdeps/unix/sysv/linux/raise.c:64
#1  0x0000003d876341c5 in abort () at abort.c:92
#2  0x0000003d8762b975 in __assert_fail (
    assertion=0x3d89222be2 "range < 16384", file=<value optimized out>,
    line=2488, function=<value optimized out>) at assert.c:81
#3  0x0000003d89212c61 in orf_token_rtr (instance=0x7f0d34171010,
    orf_token=0x7fffdefe4fb0, fcc_allowed=0x7fffdefe49cc) at totemsrp.c:2488
#4  0x0000003d89217b88 in message_handler_orf_token (instance=0x7f0d34171010,
    msg=<value optimized out>, msg_len=<value optimized out>,
    endian_conversion_needed=<value optimized out>) at totemsrp.c:3481
#5  0x0000003d8920e6c3 in rrp_deliver_fn (context=0x23a4600, msg=0x23ee86c,
    msg_len=71) at totemrrp.c:1500
#6  0x0000003d8920b136 in net_deliver_fn (handle=<value optimized out>,
    fd=<value optimized out>, revents=<value optimized out>, data=0x23ee1a0)
    at totemudp.c:1244
#7  0x0000003d8920709a in poll_run (handle=8117261320677490688)
    at coropoll.c:435
#8  0x0000000000406c6e in main (argc=<value optimized out>,
    argv=<value optimized out>, envp=<value optimized out>) at main.c:1816
(gdb) quit


Version-Release number of selected component (if applicable):
corosync-1.2.3-29.el6

How reproducible:
Unknown
 
Actual results:


Expected results:


Additional info:

Comment 2 Steven Dake 2011-03-19 01:23:04 UTC
highly reproducible test case is as follows:

have 5 nodes 1,2,3,4,5.  start node 5.  Then start node1-4 simultaneously, wait 10 seconds then stop node1-4 simultaneously.  keep repeating usually happens in 5-10 runs.

Comment 3 Steven Dake 2011-03-19 01:31:22 UTC
The resolution for bug #623176 exposed this problem.

node 3,4 stopped (by random stopping)
node 1,2,5 form new configuration and during recovery node 1 and node 2 are stopped (via service cman or service corosync stop).  This causes 5 never to get a recovery token within the timeout period, triggering a token loss in recovery.  Bug #623176 resolved an assert which happens because the full ring id was being restored.  The resolution to Bug #623176 was to not restore the full ring id, and instead operate (according to specifications) the new ring id.  Unfortunately this exposes a problem whereby the restarting of nodes 1-4 generate the same ring id.  This ring id gets to the recovery failed node 5, and triggers a condition not accounted for in the original totem specification.

It appears later work from Dr. Agarwal's PHD dissertation considers this scenario.  I have attached a patch which implements her solution, which is to essentially reject the regular token in these conditions.

Comment 4 Steven Dake 2011-03-19 17:46:06 UTC
Created attachment 486396 [details]
patch to resolve problem

Comment 5 Steven Dake 2011-03-19 17:47:14 UTC
passed 3000 restart scenarios that were failing in under 5 cases before.

Comment 7 Nate Straz 2011-03-22 17:48:47 UTC
I'm still  hitting this with the new rpm during the whiplash test.

[root@buzz-05 corosync]# rpm -q corosync
corosync-1.2.3-31.el6.x86_64

(gdb) bt
#0  0x0000003d876329e5 in raise (sig=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:64
#1  0x0000003d876341c5 in abort () at abort.c:92
#2  0x0000003d8762b975 in __assert_fail (assertion=0x7f4a3d4cbb82 "range < 16384", file=<value optimized out>, line=2497, function=<value optimized out>) at assert.c:81
#3  0x00007f4a3d4bbbe1 in orf_token_rtr (instance=0x7f4a3a5ee010, orf_token=0x7fff0d222950, fcc_allowed=0x7fff0d22236c) at totemsrp.c:2497
#4  0x00007f4a3d4c0b47 in message_handler_orf_token (instance=0x7f4a3a5ee010, msg=<value optimized out>, msg_len=<value optimized out>,
    endian_conversion_needed=<value optimized out>) at totemsrp.c:3493
#5  0x00007f4a3d4b7643 in rrp_deliver_fn (context=0x901600, msg=0x94b86c, msg_len=71) at totemrrp.c:1500
#6  0x00007f4a3d4b4136 in net_deliver_fn (handle=<value optimized out>, fd=<value optimized out>, revents=<value optimized out>, data=0x94b1a0) at totemudp.c:1244
#7  0x00007f4a3d4b009a in poll_run (handle=8117261320677490688) at coropoll.c:435
#8  0x0000000000406c6e in main (argc=<value optimized out>, argv=<value optimized out>, envp=<value optimized out>) at main.c:1816

Comment 8 Steven Dake 2011-03-24 15:38:55 UTC
Created attachment 487377 [details]
additional patch to resolve problem

Comment 10 Nate Straz 2011-03-25 11:12:01 UTC
Made it through 500 iterations of whiplash with corosync-1.2.3-32.el6.

Comment 14 errata-xmlrpc 2011-05-19 14:24:34 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2011-0764.html


Note You need to log in before you can comment on or make changes to this bug.