This bug has been copied from bug #544680 and has been proposed to be backported to 5.3 z-stream (EUS).
Technical note added. If any revisions are required, please edit the "Technical Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. New Contents: Previously, the openais protocol would fail to recover when originating more then 206 messages in the recovery state. The size of the recovery queue was increased to a more reasonable value. With this update, the protocol finishes the recovery as expected.
There seems to be a problem with the recovery. How did it happen: 1. 4-node cluster, service cman start. 2. nice -19 cpgverify > /dev/null & (3 times on each node) 3. simulated packet loss of up to 99% via netem on z1 4. when there was small number of messages in recovery (<200) it behaved more or less as expected, but when I let it to cumulate the messages a bit more, then it resulted in cluster-wide meltdown of aisexec. Oct 27 06:24:51 z2 openais[7139]: [TOTEM] Retransmit List: 574c 574d 56bd 56c0 5702 5704 5705 5706 5707 5729 572a Oct 27 06:24:51 z2 openais[7139]: [TOTEM] Retransmit List: 5471 54b8 569c 572b 574b Oct 27 06:25:01 z2 openais[7139]: [TOTEM] The token was lost in the OPERATIONAL state. Oct 27 06:25:01 z2 openais[7139]: [TOTEM] Receive multicast socket recv buffer size (262142 bytes). Oct 27 06:25:01 z2 openais[7139]: [TOTEM] Transmit multicast socket send buffer size (262142 bytes). Oct 27 06:25:01 z2 openais[7139]: [TOTEM] entering GATHER state from 2. Oct 27 06:25:15 z2 openais[7139]: [TOTEM] Saving state aru 5470 high seq received 5896 Oct 27 06:25:15 z2 openais[7139]: [TOTEM] Storing new sequence id for ring 6c Oct 27 06:25:15 z2 openais[7139]: [TOTEM] entering COMMIT state. Oct 27 06:25:25 z2 openais[7139]: [TOTEM] entering GATHER state from 13. Oct 27 06:25:45 z2 openais[7139]: [TOTEM] entering GATHER state from 11. Oct 27 06:25:45 z2 openais[7139]: [TOTEM] Creating commit token because I am the rep. Oct 27 06:25:45 z2 openais[7139]: [TOTEM] Storing new sequence id for ring 70 Oct 27 06:25:45 z2 openais[7139]: [TOTEM] entering COMMIT state. Oct 27 06:25:45 z2 openais[7139]: [TOTEM] entering RECOVERY state. Oct 27 06:25:45 z2 openais[7139]: [TOTEM] position [0] member 10.15.89.15: Oct 27 06:25:45 z2 openais[7139]: [TOTEM] previous ring seq 100 rep 10.15.89.14 Oct 27 06:25:45 z2 openais[7139]: [TOTEM] aru 5470 high delivered 5470 received flag 0 Oct 27 06:25:45 z2 openais[7139]: [TOTEM] position [1] member 10.15.89.16: Oct 27 06:25:45 z2 openais[7139]: [TOTEM] previous ring seq 100 rep 10.15.89.14 Oct 27 06:25:45 z2 openais[7139]: [TOTEM] aru 5470 high delivered 5470 received flag 0 Oct 27 06:25:45 z2 openais[7139]: [TOTEM] position [2] member 10.15.89.17: Oct 27 06:25:45 z2 openais[7139]: [TOTEM] previous ring seq 100 rep 10.15.89.14 Oct 27 06:25:45 z2 openais[7139]: [TOTEM] aru 5470 high delivered 5470 received flag 0 Oct 27 06:25:45 z2 openais[7139]: [TOTEM] copying all old ring messages from 5471-5896. Oct 27 06:25:45 z2 dlm_controld[7165]: cluster is down, exiting Oct 27 06:25:45 z2 gfs_controld[7171]: cluster is down, exiting Oct 27 06:25:45 z2 fenced[7159]: cluster is down, exiting Oct 27 06:25:45 z2 kernel: dlm: closing connection to node 4 Oct 27 06:25:45 z2 kernel: dlm: closing connection to node 3 Oct 27 06:25:45 z2 kernel: dlm: closing connection to node 2 Oct 27 06:25:47 z2 kernel: dlm: closing connection to node 1 Oct 27 06:26:08 z2 ccsd[7133]: Unable to connect to cluster infrastructure after 30 seconds. $ rpm -q openais openais-0.80.3-22.el5_3.15 I will attach the core file.
Created attachment 455953 [details] corefile of aisexec during/after recovery
Linux z2 2.6.18-128.23.2.el5 #1 SMP Fri Sep 17 06:21:22 EDT 2010 x86_64 x86_64 x86_64 GNU/Linux As it is mixed cluster, the other nodes (ia64, i686) have their respective versions of the same kernel.
(gdb) where #0 0x00000038cd630215 in raise () from /lib64/libc.so.6 #1 0x00000038cd631cc0 in abort () from /lib64/libc.so.6 #2 0x00000038cd629696 in __assert_fail () from /lib64/libc.so.6 #3 0x0000000000412047 in message_handler_memb_commit_token ( instance=0x2aaaaaaae010, msg=<value optimized out>, msg_len=<value optimized out>, endian_conversion_needed=<value optimized out>) at ../include/queue.h:118 #4 0x0000000000409c6e in rrp_deliver_fn (context=0x2aaab0000a10, msg=0x2aaab0056e08, msg_len=312) at totemrrp.c:1319 #5 0x00000000004080d6 in net_deliver_fn (handle=<value optimized out>, fd=<value optimized out>, revents=<value optimized out>, data=0x2aaab0056760) at totemnet.c:676 #6 0x0000000000405b60 in poll_run (handle=0) at aispoll.c:402 #7 0x0000000000418652 in main (argc=<value optimized out>, argv=<value optimized out>) at main.c:617 (gdb) up #3 0x0000000000412047 in message_handler_memb_commit_token ( instance=0x2aaaaaaae010, msg=<value optimized out>, msg_len=<value optimized out>, endian_conversion_needed=<value optimized out>) at ../include/queue.h:118 118 assert (queue->tail != queue->head); (gdb) print *memb_commit_token $1 = {header = {type = 4 '\004', encapsulated = 0 '\0', endian_detector = 65314, nodeid = 4}, token_seq = 4, ring_id = {rep = { nodeid = 2, family = 2, addr = "\n\017Y\017\b\000\002\000\n\017Y\017\b\000\004"}, seq = 112}, retrans_flg = 0, memb_index = 3, addr_entries = 3, end_of_commit_token = 0x2aaab0056e3e "\002"} This reported core dump is a dup of Bug #631835. Thanks for the test case, I'll give it a go on the parent bug - should help understand how to fix this existing issue. Changing this bug back to ON_QA. Since the bug described in this bugzilla is actually fixed, but another issue is causing you test case to segfault.
(gdb) print instance->retrans_message_queue $3 = {head = 499, tail = 499, used = 499, usedhw = 499, size = 500, items = 0x2aaab0000b40, size_per_item = 176, iterator = 0, mutex = { __data = {__lock = 1, __count = 0, __owner = 7139, __nusers = 1, __kind = 0, __spins = 0, __list = {__prev = 0x0, __next = 0x0}}, __size = "\001\000\000\000\000\000\000\000�\033\000\000\001", '\0' <repeats 26 times>, __align = 1}}
This sigabort is a dup of bug #588627 which was not copied to the 5.3 z stream. It is caused when the retrans list becomes full. The bug described in this bugzilla is fixed. Please ignore comments about Bug #631835 in comment #8 although the stack trace is valid. If you want 588627 backported to z stream, please speak with Dean Jansa.
together with fix for bug 647290 this now looks much better: Nov 1 10:08:41 z2 openais[9796]: [TOTEM] The token was lost in the OPERATIONAL state. Nov 1 10:08:41 z2 openais[9796]: [TOTEM] Receive multicast socket recv buffer size (320000 bytes). Nov 1 10:08:41 z2 openais[9796]: [TOTEM] Transmit multicast socket send buffer size (262142 bytes). Nov 1 10:08:41 z2 openais[9796]: [TOTEM] entering GATHER state from 2. Nov 1 10:09:01 z2 openais[9796]: [TOTEM] entering GATHER state from 11. Nov 1 10:09:01 z2 openais[9796]: [TOTEM] Creating commit token because I am the rep. Nov 1 10:09:01 z2 openais[9796]: [TOTEM] Saving state aru c2f7 high seq received c5ba Nov 1 10:09:01 z2 openais[9796]: [TOTEM] Storing new sequence id for ring 84 Nov 1 10:09:01 z2 openais[9796]: [TOTEM] entering COMMIT state. Nov 1 10:09:01 z2 openais[9796]: [TOTEM] entering RECOVERY state. Nov 1 10:09:01 z2 openais[9796]: [TOTEM] position [0] member 10.15.89.15: Nov 1 10:09:01 z2 openais[9796]: [TOTEM] previous ring seq 128 rep 10.15.89.14 Nov 1 10:09:01 z2 openais[9796]: [TOTEM] aru c2f7 high delivered c2f7 received flag 0 Nov 1 10:09:01 z2 openais[9796]: [TOTEM] position [1] member 10.15.89.16: Nov 1 10:09:01 z2 openais[9796]: [TOTEM] previous ring seq 128 rep 10.15.89.14 Nov 1 10:09:01 z2 openais[9796]: [TOTEM] aru c2f7 high delivered c2f7 received flag 0 Nov 1 10:09:01 z2 openais[9796]: [TOTEM] position [2] member 10.15.89.17: Nov 1 10:09:01 z2 openais[9796]: [TOTEM] previous ring seq 128 rep 10.15.89.14 Nov 1 10:09:01 z2 openais[9796]: [TOTEM] aru c2f7 high delivered c2f7 received flag 0 Nov 1 10:09:01 z2 openais[9796]: [TOTEM] copying all old ring messages from c2f8-c5ba. Nov 1 10:09:01 z2 openais[9796]: [TOTEM] Originated 687 messages in RECOVERY. openais-0.80.3-22.el5_3.16
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2010-0828.html