Bug 584559
| Summary: | originating 206 or more messages in recovery causes totem to block | ||||||
|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux 5 | Reporter: | Benjamin Kahn <bkahn> | ||||
| Component: | openais | Assignee: | Steven Dake <sdake> | ||||
| Status: | CLOSED ERRATA | QA Contact: | Cluster QE <mspqa-list> | ||||
| Severity: | urgent | Docs Contact: | |||||
| Priority: | urgent | ||||||
| Version: | 5.4 | CC: | cluster-maint, dejohnso, djansa, edamato, fnadge, jkortus, jwest, pm-eus, sdake | ||||
| Target Milestone: | rc | Keywords: | ZStream | ||||
| Target Release: | --- | ||||||
| Hardware: | All | ||||||
| OS: | Linux | ||||||
| Whiteboard: | |||||||
| Fixed In Version: | openais-0.80.3-22.el5_3.14 | Doc Type: | Bug Fix | ||||
| Doc Text: |
Previously, the openais protocol would fail to recover when originating more then 206 messages in the recovery state. The size of the recovery queue was increased to a more reasonable value. With this update, the protocol finishes the recovery as expected.
|
Story Points: | --- | ||||
| Clone Of: | Environment: | ||||||
| Last Closed: | 2010-11-05 15:25:42 UTC | Type: | --- | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Embargoed: | |||||||
| Bug Depends On: | 544680 | ||||||
| Bug Blocks: | |||||||
| Attachments: |
|
||||||
|
Description
Benjamin Kahn
2010-04-21 20:53:47 UTC
Technical note added. If any revisions are required, please edit the "Technical Notes" field
accordingly. All revisions will be proofread by the Engineering Content Services team.
New Contents:
Previously, the openais protocol would fail to recover when originating more then 206 messages in the recovery state. The size of the recovery queue was increased to a more reasonable value. With this update, the protocol finishes the recovery as expected.
There seems to be a problem with the recovery. How did it happen: 1. 4-node cluster, service cman start. 2. nice -19 cpgverify > /dev/null & (3 times on each node) 3. simulated packet loss of up to 99% via netem on z1 4. when there was small number of messages in recovery (<200) it behaved more or less as expected, but when I let it to cumulate the messages a bit more, then it resulted in cluster-wide meltdown of aisexec. Oct 27 06:24:51 z2 openais[7139]: [TOTEM] Retransmit List: 574c 574d 56bd 56c0 5702 5704 5705 5706 5707 5729 572a Oct 27 06:24:51 z2 openais[7139]: [TOTEM] Retransmit List: 5471 54b8 569c 572b 574b Oct 27 06:25:01 z2 openais[7139]: [TOTEM] The token was lost in the OPERATIONAL state. Oct 27 06:25:01 z2 openais[7139]: [TOTEM] Receive multicast socket recv buffer size (262142 bytes). Oct 27 06:25:01 z2 openais[7139]: [TOTEM] Transmit multicast socket send buffer size (262142 bytes). Oct 27 06:25:01 z2 openais[7139]: [TOTEM] entering GATHER state from 2. Oct 27 06:25:15 z2 openais[7139]: [TOTEM] Saving state aru 5470 high seq received 5896 Oct 27 06:25:15 z2 openais[7139]: [TOTEM] Storing new sequence id for ring 6c Oct 27 06:25:15 z2 openais[7139]: [TOTEM] entering COMMIT state. Oct 27 06:25:25 z2 openais[7139]: [TOTEM] entering GATHER state from 13. Oct 27 06:25:45 z2 openais[7139]: [TOTEM] entering GATHER state from 11. Oct 27 06:25:45 z2 openais[7139]: [TOTEM] Creating commit token because I am the rep. Oct 27 06:25:45 z2 openais[7139]: [TOTEM] Storing new sequence id for ring 70 Oct 27 06:25:45 z2 openais[7139]: [TOTEM] entering COMMIT state. Oct 27 06:25:45 z2 openais[7139]: [TOTEM] entering RECOVERY state. Oct 27 06:25:45 z2 openais[7139]: [TOTEM] position [0] member 10.15.89.15: Oct 27 06:25:45 z2 openais[7139]: [TOTEM] previous ring seq 100 rep 10.15.89.14 Oct 27 06:25:45 z2 openais[7139]: [TOTEM] aru 5470 high delivered 5470 received flag 0 Oct 27 06:25:45 z2 openais[7139]: [TOTEM] position [1] member 10.15.89.16: Oct 27 06:25:45 z2 openais[7139]: [TOTEM] previous ring seq 100 rep 10.15.89.14 Oct 27 06:25:45 z2 openais[7139]: [TOTEM] aru 5470 high delivered 5470 received flag 0 Oct 27 06:25:45 z2 openais[7139]: [TOTEM] position [2] member 10.15.89.17: Oct 27 06:25:45 z2 openais[7139]: [TOTEM] previous ring seq 100 rep 10.15.89.14 Oct 27 06:25:45 z2 openais[7139]: [TOTEM] aru 5470 high delivered 5470 received flag 0 Oct 27 06:25:45 z2 openais[7139]: [TOTEM] copying all old ring messages from 5471-5896. Oct 27 06:25:45 z2 dlm_controld[7165]: cluster is down, exiting Oct 27 06:25:45 z2 gfs_controld[7171]: cluster is down, exiting Oct 27 06:25:45 z2 fenced[7159]: cluster is down, exiting Oct 27 06:25:45 z2 kernel: dlm: closing connection to node 4 Oct 27 06:25:45 z2 kernel: dlm: closing connection to node 3 Oct 27 06:25:45 z2 kernel: dlm: closing connection to node 2 Oct 27 06:25:47 z2 kernel: dlm: closing connection to node 1 Oct 27 06:26:08 z2 ccsd[7133]: Unable to connect to cluster infrastructure after 30 seconds. $ rpm -q openais openais-0.80.3-22.el5_3.15 I will attach the core file. Created attachment 455953 [details]
corefile of aisexec during/after recovery
Linux z2 2.6.18-128.23.2.el5 #1 SMP Fri Sep 17 06:21:22 EDT 2010 x86_64 x86_64 x86_64 GNU/Linux As it is mixed cluster, the other nodes (ia64, i686) have their respective versions of the same kernel. (gdb) where
#0 0x00000038cd630215 in raise () from /lib64/libc.so.6
#1 0x00000038cd631cc0 in abort () from /lib64/libc.so.6
#2 0x00000038cd629696 in __assert_fail () from /lib64/libc.so.6
#3 0x0000000000412047 in message_handler_memb_commit_token (
instance=0x2aaaaaaae010, msg=<value optimized out>,
msg_len=<value optimized out>,
endian_conversion_needed=<value optimized out>) at ../include/queue.h:118
#4 0x0000000000409c6e in rrp_deliver_fn (context=0x2aaab0000a10,
msg=0x2aaab0056e08, msg_len=312) at totemrrp.c:1319
#5 0x00000000004080d6 in net_deliver_fn (handle=<value optimized out>,
fd=<value optimized out>, revents=<value optimized out>,
data=0x2aaab0056760) at totemnet.c:676
#6 0x0000000000405b60 in poll_run (handle=0) at aispoll.c:402
#7 0x0000000000418652 in main (argc=<value optimized out>,
argv=<value optimized out>) at main.c:617
(gdb) up
#3 0x0000000000412047 in message_handler_memb_commit_token (
instance=0x2aaaaaaae010, msg=<value optimized out>,
msg_len=<value optimized out>,
endian_conversion_needed=<value optimized out>) at ../include/queue.h:118
118 assert (queue->tail != queue->head);
(gdb) print *memb_commit_token
$1 = {header = {type = 4 '\004', encapsulated = 0 '\0',
endian_detector = 65314, nodeid = 4}, token_seq = 4, ring_id = {rep = {
nodeid = 2, family = 2,
addr = "\n\017Y\017\b\000\002\000\n\017Y\017\b\000\004"}, seq = 112},
retrans_flg = 0, memb_index = 3, addr_entries = 3,
end_of_commit_token = 0x2aaab0056e3e "\002"}
This reported core dump is a dup of Bug #631835. Thanks for the test case, I'll give it a go on the parent bug - should help understand how to fix this existing issue. Changing this bug back to ON_QA. Since the bug described in this bugzilla is actually fixed, but another issue is causing you test case to segfault.
(gdb) print instance->retrans_message_queue
$3 = {head = 499, tail = 499, used = 499, usedhw = 499, size = 500,
items = 0x2aaab0000b40, size_per_item = 176, iterator = 0, mutex = {
__data = {__lock = 1, __count = 0, __owner = 7139, __nusers = 1,
__kind = 0, __spins = 0, __list = {__prev = 0x0, __next = 0x0}},
__size = "\001\000\000\000\000\000\000\000�\033\000\000\001", '\0' <repeats 26 times>, __align = 1}}
This sigabort is a dup of bug #588627 which was not copied to the 5.3 z stream. It is caused when the retrans list becomes full. The bug described in this bugzilla is fixed. Please ignore comments about Bug #631835 in comment #8 although the stack trace is valid. If you want 588627 backported to z stream, please speak with Dean Jansa. together with fix for bug 647290 this now looks much better: Nov 1 10:08:41 z2 openais[9796]: [TOTEM] The token was lost in the OPERATIONAL state. Nov 1 10:08:41 z2 openais[9796]: [TOTEM] Receive multicast socket recv buffer size (320000 bytes). Nov 1 10:08:41 z2 openais[9796]: [TOTEM] Transmit multicast socket send buffer size (262142 bytes). Nov 1 10:08:41 z2 openais[9796]: [TOTEM] entering GATHER state from 2. Nov 1 10:09:01 z2 openais[9796]: [TOTEM] entering GATHER state from 11. Nov 1 10:09:01 z2 openais[9796]: [TOTEM] Creating commit token because I am the rep. Nov 1 10:09:01 z2 openais[9796]: [TOTEM] Saving state aru c2f7 high seq received c5ba Nov 1 10:09:01 z2 openais[9796]: [TOTEM] Storing new sequence id for ring 84 Nov 1 10:09:01 z2 openais[9796]: [TOTEM] entering COMMIT state. Nov 1 10:09:01 z2 openais[9796]: [TOTEM] entering RECOVERY state. Nov 1 10:09:01 z2 openais[9796]: [TOTEM] position [0] member 10.15.89.15: Nov 1 10:09:01 z2 openais[9796]: [TOTEM] previous ring seq 128 rep 10.15.89.14 Nov 1 10:09:01 z2 openais[9796]: [TOTEM] aru c2f7 high delivered c2f7 received flag 0 Nov 1 10:09:01 z2 openais[9796]: [TOTEM] position [1] member 10.15.89.16: Nov 1 10:09:01 z2 openais[9796]: [TOTEM] previous ring seq 128 rep 10.15.89.14 Nov 1 10:09:01 z2 openais[9796]: [TOTEM] aru c2f7 high delivered c2f7 received flag 0 Nov 1 10:09:01 z2 openais[9796]: [TOTEM] position [2] member 10.15.89.17: Nov 1 10:09:01 z2 openais[9796]: [TOTEM] previous ring seq 128 rep 10.15.89.14 Nov 1 10:09:01 z2 openais[9796]: [TOTEM] aru c2f7 high delivered c2f7 received flag 0 Nov 1 10:09:01 z2 openais[9796]: [TOTEM] copying all old ring messages from c2f8-c5ba. Nov 1 10:09:01 z2 openais[9796]: [TOTEM] Originated 687 messages in RECOVERY. openais-0.80.3-22.el5_3.16 An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2010-0828.html |