Hide Forgot
Description of problem: corosync sometimes crashes while doing cman_tool join; cman_tool leave in a loop Version-Release number of selected component (if applicable): corosync-1.4.1-1.el6.x86_64 How reproducible: I've seen 2 crashes in ~15 minutes Steps to Reproduce: 1. while true; do cman_tool join; cman_tool leave; done (on all cluster nodes) 2. 3. Actual results: Most of the time it runs as expected, nodes leaving/joining, but sometimes it crashes (is aborted). Expected results: No crashes/aborts Additional info: Core was generated by `corosync -f'. Program terminated with signal 6, Aborted. #0 0x00000030ad232a45 in raise () from /lib64/libc.so.6 No symbol table info available. #1 0x00000030ad234225 in abort () from /lib64/libc.so.6 No symbol table info available. #2 0x00000030ad22b9d5 in __assert_fail () from /lib64/libc.so.6 No symbol table info available. #3 0x0000003d71013aa1 in orf_token_rtr (instance=0x7fe738486010, orf_token=0x7fff9a061ff0, fcc_allowed=0x7fff9a061a0c) at totemsrp.c:2528 res = <value optimized out> i = 0 j = <value optimized out> found = <value optimized out> total_entries = <value optimized out> sort_queue = 0x7fe7384a7780 rtr_list = 0x7fff9a062036 range = <value optimized out> retransmit_msg = "Retransmit List: \000\272\224Um\330\002p\212&\336,\027\362J\350e\371)\325\371\213\243\242\036\212\270\301\022\255\337zh'\316M\b\177\fM\003q'\365\367\305-\036\251+\341؍Qa\277\233\366^e\202s\356\365.\325\313\026\323G5`ݺ}\362o\322\036\204\002>\212Xw\233\301*\005\023m\333\064?o5e0\306\bS\334\363ч\276\251\325\341\354([\306~\214i\376\017\316k\243\270\241\346G\353p\223Ѡ|\016\231\342\363\246?\242]\331Q\n\026\260\273\005\232\377\177\000\000R\t\f\232\377\177\000\000\001\000\000\000\000\000\000\000\234D_\002\000\000\000\000\234D_\002\000\000\000\000\000\274\005\232\000\000\000\000\300\273\005\232\377\177\000\000\020`H8\347\177\000\000\260\275\005\232\377\177\000\000\234D_\002\000\000\000\000\234D_\002\000\000\000\000\000\274\005\232\377\177\000\000н\005\232\377\177\000\000\071\201\001q=\000\000\000@U\002q=\000\000\000 \274\005\232\377\177\000\000 \001\000\000\000\000\000\000\020\000\000\000\000\000\000\000\254"... value = "\001\000\000\000\000\000\000\000\020`H8\347\177\000\000\020\001Z\002\000\000\000\000\350wJ8\347\177\000\000\020\277\005\232\377\177\000\000\067\n\f\232\377\177\000\000@\277\005\232\377\177\000\000\177>`\256\060\000\000" __FUNCTION__ = "orf_token_rtr" __PRETTY_FUNCTION__ = "orf_token_rtr" #4 0x0000003d71018a17 in message_handler_orf_token (instance=0x7fe738486010, msg=<value optimized out>, msg_len=<value optimized out>, endian_conversion_needed=<value optimized out>) at totemsrp.c:3517 token_storage = "\000\000\"\377\004\000\000\000\000\000\000\000\016", '\000' <repeats 11 times>, "\002\000\000\000\002\000\n\017YH\b\000\002\000\n\017YH\b\000\004\000\250$", '\000' <repeats 24 times>, "\002\000\n\017YG\b\000\002\000\n\017YG\b\000\004", '\000' <repeats 24 times>, "\017YI\b\000\004", '\000' <repeats 23 times>, "\004\000\000\000\002\000\n\017YJ\b\000\002\000\n\017YJ\b\000\004", '\000' <repeats 23 times>, "\005\000\000\000\002\000\n\017YK\b\000\002\000\n\017YK\b\000\004", '\000' <repeats 23 times>, "\002\000\000\000\002\000\n\017YH\b\000\002\000\n\017YH\b\000\004\000\244$\000\000\000\000\000\000\r\000\000\000\r\000\000\000\001\000\000\000"... token_convert = "(\000\000\000\060\000\000\000\360%\006\232\377\177\000\000\060%\006\232\377\177\000\000\314\033\006\232\377\177\000\000\217#\006\232\377\177\000\000\220\033\006\232\377\177\000\000\217#\006\232\377\177", '\000' <repeats 16 times>, "15", '\000' <repeats 24 times>"\377, \377\377\377\004\000\000\000\001\000\000\000\000\000\000\000(\000\000\000\060\000\000\000`&\006\232\377\177\000\000\240%\006\232\377\177", '\000' <repeats 26 times>, "[\000\000\000|\000\000\000w\000\000\000n\000\000\000\377\377\377\377\060\000\000\000\001", '\000' <repeats 15 times>, "p", '\000' <repeats 15 times>, "0\033\006\232\377\177\000\000\300\034\006\232\377\177\000\000 \034\006\232\377\177\000\000\001\000\000\000\000\000\000\000\020%\006\232\377\177\000\000\005\000\000\000\a\000\000\000\023\000\000\000U\000\000\000\300\034\006\232\377\177\000\000\060\334/\255\060\000\000\000\001\200\255\373\000\000\000\000\300"... [...]
Created attachment 519424 [details] crash dump produced by abrt
*** Bug 733049 has been marked as a duplicate of this bug. ***
Created attachment 519926 [details] corosync-blackbox output from buzz nodes I think I'm hitting a similar issue. Sometimes when I server cman it doesn't get membership on all nodes. In this case on buzz 2/5 nodes think we have full membership and 3/5 nodes think we are missing buzz-01.
[root@beast corosync]# git bisect good bfddaef2e4aed60173a03cc6df2f43096c5d35fe is the first bad commit commit bfddaef2e4aed60173a03cc6df2f43096c5d35fe Author: Jan Friesse <jfriesse@redhat.com> Date: Mon Jul 25 15:18:10 2011 +0200 main: let poll really stop before totempg_finalize Signed-off-by: Jan Friesse <jfriesse@redhat.com> Reviewed-by: Steven Dake <sdake@redhat.com> (cherry picked from commit d4fb83e971b6fa9af0447ce0a70345fb20064dc1) :040000 040000 23728192b395bd114ab73ae9cd1a08a3ddec54b5 65bf1e2c159716871664ba2d0a7c6aea9a18de30 M exec
Comment #12 was in error. It is actually b58309c943aef2920492aeb83582689a28710457 Author: Jan Friesse <jfriesse@redhat.com> Date: Tue Jul 26 10:05:34 2011 +0200 Revert "totemsrp: Remove recv_flush code" This reverts commit 1a7b7a39f445be63c697170c1680eeca9834de39. Reversion is needed to remove overflow of receive buffers and dropping messages. Signed-off-by: Jan Friesse <jfriesse@redhat.com> (cherry picked from commit ddb5214c2c57194fe8e12d775398bfc5726743c4) this introduces the regression.
This problem does not reproduce using corosync init scripts (with encryption enabled). It requires cman join/cman leave. RHEL 6.1 cman service engine would segfault in this type of loop. It appears the segfault in RHEL 6.1 cman service engine was resolved, which then after further un-pealing exposes this problem. The fact the cman service engine is unique to the failure scenario helps further narrow the problem.
rec=[824051] Log Message=Originated 7 messages in RECOVERY. rec=[824052] Log Message=token retrans flag is 1 my set retrans flag0 retrans queue empty 1 count 2, aru ffffffff rec=[824053] Log Message=install seq 0 aru 7 high seq received 7 rec=[824054] Log Message=Received ringid(10.16.144.34:62876) seq 1 rec=[824055] Log Message=Received ringid(10.16.144.34:62876) seq 2 rec=[824056] Log Message=Received ringid(10.16.144.34:62876) seq 3 rec=[824057] Log Message=Received ringid(10.16.144.34:62876) seq 4 rec=[824058] Log Message=Received ringid(10.16.144.34:62876) seq 5 rec=[824059] Log Message=Received ringid(10.16.144.34:62876) seq 6 rec=[824060] Log Message=Received ringid(10.16.144.34:62876) seq 7 rec=[824061] Log Message=Received ringid(10.16.144.34:62876) seq 8 rec=[824062] Log Message=Received ringid(10.16.144.34:62876) seq 9 rec=[824063] Log Message=Received ringid(10.16.144.34:62876) seq a rec=[824064] Log Message=Received ringid(10.16.144.34:62876) seq b rec=[824065] Log Message=Received ringid(10.16.144.34:62876) seq c rec=[824066] Log Message=Received ringid(10.16.144.34:62876) seq d rec=[824067] Log Message=Received ringid(10.16.144.34:62876) seq e rec=[824068] Log Message=Received ringid(10.16.144.34:62876) seq f rec=[824069] Log Message=Received ringid(10.16.144.34:62876) seq 10 rec=[824070] Log Message=Received ringid(10.16.144.34:62876) seq 11 rec=[824071] Log Message=Received ringid(10.16.144.34:62876) seq 12 rec=[824072] Log Message=Received ringid(10.16.144.34:62876) seq 13 rec=[824073] Log Message=Received ringid(10.16.144.34:62876) seq 14 rec=[824074] Log Message=Received ringid(10.16.144.34:62876) seq 15 rec=[824075] Log Message=Received ringid(10.16.144.34:62876) seq 16 rec=[824076] Log Message=Received ringid(10.16.144.34:62876) seq 17 rec=[824077] Log Message=Received ringid(10.16.144.34:62876) seq 18 rec=[824078] Log Message=Received ringid(10.16.144.34:62876) seq 19 rec=[824079] Log Message=Received ringid(10.16.144.34:62876) seq 1a rec=[824080] Log Message=Received ringid(10.16.144.34:62876) seq 1b rec=[824081] Log Message=Received ringid(10.16.144.34:62876) seq 1c rec=[824082] Log Message=Received ringid(10.16.144.34:62876) seq 1d rec=[824083] Log Message=Received ringid(10.16.144.34:62876) seq 1e rec=[824084] Log Message=Received ringid(10.16.144.34:62876) seq 1f rec=[824085] Log Message=Received ringid(10.16.144.34:62876) seq 20 rec=[824086] Log Message=Received ringid(10.16.144.34:62876) seq 21 rec=[824087] Log Message=token retrans flag is 0 my set retrans flag0 retrans queue empty 1 count 0, aru 21 rec=[824088] Log Message=install seq 0 aru 21 high seq received 21 rec=[824089] Log Message=Received ringid(10.16.144.34:62876) seq 22 rec=[824090] Log Message=Received ringid(10.16.144.34:62876) seq 23 rec=[824091] Log Message=token retrans flag is 0 my set retrans flag0 retrans queue empty 1 count 1, aru 21 rec=[824092] Log Message=install seq 23 aru 21 high seq received 21 rec=[824093] Log Message=token retrans flag is 0 my set retrans flag0 retrans queue empty 1 count 2, aru 21 rec=[824094] Log Message=install seq 23 aru 21 high seq received 21 rec=[824095] Log Message=token retrans flag is 0 my set retrans flag0 retrans queue empty 1 count 3, aru 21 rec=[824096] Log Message=install seq 23 aru 21 high seq received 21 rec=[824097] Log Message=token retrans flag is 0 my set retrans flag0 retrans queue empty 1 count 4, aru 21 rec=[824098] Log Message=install seq 23 aru 21 high seq received 21 rec=[824099] Log Message=token retrans flag is 0 my set retrans flag0 retrans queue empty 1 count 5, aru 21 rec=[824100] Log Message=install seq 23 aru 21 high seq received 21 rec=[824101] Log Message=Received ringid(10.16.144.34:62876) seq 22 rec=[824102] Log Message=Received ringid(10.16.144.34:62876) seq 23 rec=[824103] Log Message=token retrans flag is 0 my set retrans flag0 retrans queue empty 1 count 6, aru 21 rec=[824104] Log Message=install seq 23 aru 21 high seq received 21 rec=[824105] Log Message=Received ringid(10.16.144.34:62876) seq 22 rec=[824106] Log Message=Received ringid(10.16.144.34:62876) seq 23 rec=[824107] Log Message=token retrans flag is 0 my set retrans flag0 retrans queue empty 1 count 7, aru 21 rec=[824108] Log Message=install seq 23 aru 21 high seq received 21 rec=[824109] Log Message=Received ringid(10.16.144.34:62876) seq 22 rec=[824110] Log Message=Received ringid(10.16.144.34:62876) seq 23 rec=[824111] Log Message=token retrans flag is 0 my set retrans flag0 retrans queue empty 1 count 8, aru 21 rec=[824112] Log Message=install seq 23 aru 21 high seq received 21 rec=[824113] Log Message=Received ringid(10.16.144.34:62876) seq 22 rec=[824114] Log Message=Received ringid(10.16.144.34:62876) seq 23 rec=[824115] Log Message=token retrans flag is 0 my set retrans flag0 retrans queue empty 1 count 9, aru 21 rec=[824116] Log Message=install seq 23 aru 21 high seq received 21 rec=[824117] Log Message=Received ringid(10.16.144.34:62876) seq 22 rec=[824118] Log Message=Received ringid(10.16.144.34:62876) seq 23 rec=[824119] Log Message=token retrans flag is 0 my set retrans flag0 retrans queue empty 1 count 10, aru 21 rec=[824120] Log Message=install seq 23 aru 21 high seq received 21 rec=[824121] Log Message=Received ringid(10.16.144.34:62876) seq 22 rec=[824122] Log Message=Received ringid(10.16.144.34:62876) seq 23 rec=[824123] Log Message=token retrans flag is 0 my set retrans flag0 retrans queue empty 1 count 11, aru 21 rec=[824124] Log Message=install seq 23 aru 21 high seq received 21 rec=[824125] Log Message=Received ringid(10.16.144.34:62876) seq 22 rec=[824126] Log Message=Received ringid(10.16.144.34:62876) seq 23 rec=[824127] Log Message=token retrans flag is 0 my set retrans flag0 retrans queue empty 1 count 12, aru 21 rec=[824128] Log Message=install seq 23 aru 21 high seq received 21 rec=[824129] Log Message=Received ringid(10.16.144.34:62876) seq 22 rec=[824130] Log Message=Received ringid(10.16.144.34:62876) seq 23 rec=[824131] Log Message=token retrans flag is 0 my set retrans flag0 retrans queue empty 1 count 13, aru 21 rec=[824132] Log Message=install seq 23 aru 21 high seq received 21 rec=[824133] Log Message=Received ringid(10.16.144.34:62876) seq 22 rec=[824134] Log Message=Received ringid(10.16.144.34:62876) seq 23 rec=[824135] Log Message=token retrans flag is 0 my set retrans flag0 retrans queue empty 1 count 14, aru 21 rec=[824136] Log Message=install seq 23 aru 21 high seq received 21 rec=[824137] Log Message=Received ringid(10.16.144.34:62876) seq 22 rec=[824138] Log Message=Received ringid(10.16.144.34:62876) seq 23 rec=[824139] Log Message=token retrans flag is 0 my set retrans flag0 retrans queue empty 1 count 15, aru 21 rec=[824140] Log Message=install seq 23 aru 21 high seq received 21 rec=[824141] Log Message=Received ringid(10.16.144.34:62876) seq 22 rec=[824142] Log Message=Received ringid(10.16.144.34:62876) seq 23 rec=[824143] Log Message=token retrans flag is 0 my set retrans flag0 retrans queue empty 1 count 16, aru 21 rec=[824144] Log Message=install seq 23 aru 21 high seq received 21 rec=[824145] Log Message=Received ringid(10.16.144.34:62876) seq 22 rec=[824146] Log Message=Received ringid(10.16.144.34:62876) seq 23 rec=[824147] Log Message=Process pause detected for 8209 ms, flushing membership messages. rec=[824148] Log Message=entering GATHER state from 11. rec=[824149] Log Message=Restoring instance->my_aru 9381 my high seq received 938e rec=[824150] Log Message=entering GATHER state from 5. rec=[824151] Log Message=entering GATHER state from 14. rec=[824152] Log Message=Process pause detected for 8209 ms, flushing membership messages. Finishing replay: records found [34214] aru should eventually reach 23 since there are 24 sequence numbers in this recovery message transfer. We get stuck at 21. This happens via update_aru which calculates based upon the value of instance->my_high_seq_received. This value is only set on receipt of multicast messages that meet the following criteria: if (msg_len > 0 && msg_len <= FRAME_SIZE_MAX && sq_in_range (sort_queue, mcast_header.seq) && sq_item_inuse (sort_queue, mcast_header.seq) == 0) { ... update my_high_seq_received ... }
Added some more information to see why the variable wasn't being set. Got following: rec=[2582244] Log Message=token retrans flag is 1 my set retrans flag0 retrans queue empty 1 count 0, aru ffffffff rec=[2582245] Log Message=install seq 0 aru 1 high seq received 1 rec=[2582246] Log Message=Received ringid(10.16.144.34:11064) seq 1 encap 1 rec=[2582247] Log Message=msg len 264 in range 1 inuse 1 rec=[2582248] Log Message=Received ringid(10.16.144.34:11064) seq 2 encap 1 rec=[2582249] Log Message=msg len 263 in range 1 inuse 0 rec=[2582250] Log Message=Process pause detected for 18786 ms, flushing membership messages. rec=[2582251] Log Message=entering GATHER state from 11. rec=[2582252] Log Message=Restoring instance->my_aru 14 my high seq received 14 rec=[2582253] Log Message=entering GATHER state from 5. rec=[2582254] Log Message=entering GATHER state from 14. rec=[2582255] Log Message=Process pause detected for 18786 ms, flushing membership messages. Finishing replay: records found [35694] This occurs at same time as kernel dmesg output on console: megaraid_sas 0000:03:00.0: vpd r/w failed. This is likely a firmware bug on this device. Contact the card vendor for a firmware update. The only buzz node with a failure contains the following in its dmesg log: mpt2sas 0000:03:00.0: vpd r/w failed. This is likely a firmware bug on this device. Contact the card vendor for a firmware update. Bug #732956 might be related. Thread at http://www.spinics.net/lists/linux-scsi/msg52998.html might be related. Corosync will not function well when the kernel pauses for long periods of time. Jaroslav, can you validate the problem does not accompany a vpd r/w failed error in your system logs? Thanks -steve
memb_join messages are occurring within the recv_flush operation. This results in entering the gather state from recovery at totemsrp.c:4253. The token is accepted in recovery but during token processing the protocol switches to gather. It then operates on the regular sort queue when it should be operating on the recovery sort queue. One possible workaround would be to ignore all messages in the flush operation which are not of type MEMB_TYPE_MCAST. If that simple workaround doesn't work, the flush operation will have to create a list of non MEMB_TYPE_MCAST messages and deliver them in some other way. This second workaround sounds complicated. I'll give first workaround a spin in next few hours. Not sure on the other problems exposed in this bugzilla.
Created attachment 520740 [details] patch which fixes problem
Technical note added. If any revisions are required, please edit the "Technical Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. New Contents: Prior to this update, when corosync ran the "cman_tool join" and "cman_tool leave" commands in a loop, corosync sometimes terminated unexpectedly. This bug has been fixed, and corosync no longer crashes in the described scenario.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHBA-2011-1515.html