Bug 732698
Summary: | corosync sometimes crashes while doing cman_tool join; cman_tool leave in a loop | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 6 | Reporter: | Jaroslav Kortus <jkortus> | ||||||||
Component: | corosync | Assignee: | Jan Friesse <jfriesse> | ||||||||
Status: | CLOSED ERRATA | QA Contact: | Cluster QE <mspqa-list> | ||||||||
Severity: | urgent | Docs Contact: | |||||||||
Priority: | urgent | ||||||||||
Version: | 6.2 | CC: | cluster-maint, djansa, jwest, sdake | ||||||||
Target Milestone: | rc | Keywords: | TestBlocker, ZStream | ||||||||
Target Release: | --- | ||||||||||
Hardware: | Unspecified | ||||||||||
OS: | Unspecified | ||||||||||
Whiteboard: | |||||||||||
Fixed In Version: | corosync-1.4.1-2.el6 | Doc Type: | Bug Fix | ||||||||
Doc Text: |
Prior to this update, when corosync ran the "cman_tool join" and "cman_tool leave" commands in a loop, corosync sometimes terminated unexpectedly. This bug has been fixed, and corosync no longer crashes in the described scenario.
|
Story Points: | --- | ||||||||
Clone Of: | |||||||||||
: | 734865 (view as bug list) | Environment: | |||||||||
Last Closed: | 2011-12-06 11:51:33 UTC | Type: | --- | ||||||||
Regression: | --- | Mount Type: | --- | ||||||||
Documentation: | --- | CRM: | |||||||||
Verified Versions: | Category: | --- | |||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||
Embargoed: | |||||||||||
Bug Depends On: | |||||||||||
Bug Blocks: | 734865, 734996, 734997 | ||||||||||
Attachments: |
|
Description
Jaroslav Kortus
2011-08-23 10:26:33 UTC
Created attachment 519424 [details]
crash dump produced by abrt
*** Bug 733049 has been marked as a duplicate of this bug. *** Created attachment 519926 [details]
corosync-blackbox output from buzz nodes
I think I'm hitting a similar issue. Sometimes when I server cman it doesn't get membership on all nodes. In this case on buzz 2/5 nodes think we have full membership and 3/5 nodes think we are missing buzz-01.
[root@beast corosync]# git bisect good bfddaef2e4aed60173a03cc6df2f43096c5d35fe is the first bad commit commit bfddaef2e4aed60173a03cc6df2f43096c5d35fe Author: Jan Friesse <jfriesse> Date: Mon Jul 25 15:18:10 2011 +0200 main: let poll really stop before totempg_finalize Signed-off-by: Jan Friesse <jfriesse> Reviewed-by: Steven Dake <sdake> (cherry picked from commit d4fb83e971b6fa9af0447ce0a70345fb20064dc1) :040000 040000 23728192b395bd114ab73ae9cd1a08a3ddec54b5 65bf1e2c159716871664ba2d0a7c6aea9a18de30 M exec Comment #12 was in error. It is actually b58309c943aef2920492aeb83582689a28710457 Author: Jan Friesse <jfriesse> Date: Tue Jul 26 10:05:34 2011 +0200 Revert "totemsrp: Remove recv_flush code" This reverts commit 1a7b7a39f445be63c697170c1680eeca9834de39. Reversion is needed to remove overflow of receive buffers and dropping messages. Signed-off-by: Jan Friesse <jfriesse> (cherry picked from commit ddb5214c2c57194fe8e12d775398bfc5726743c4) this introduces the regression. This problem does not reproduce using corosync init scripts (with encryption enabled). It requires cman join/cman leave. RHEL 6.1 cman service engine would segfault in this type of loop. It appears the segfault in RHEL 6.1 cman service engine was resolved, which then after further un-pealing exposes this problem. The fact the cman service engine is unique to the failure scenario helps further narrow the problem. rec=[824051] Log Message=Originated 7 messages in RECOVERY. rec=[824052] Log Message=token retrans flag is 1 my set retrans flag0 retrans queue empty 1 count 2, aru ffffffff rec=[824053] Log Message=install seq 0 aru 7 high seq received 7 rec=[824054] Log Message=Received ringid(10.16.144.34:62876) seq 1 rec=[824055] Log Message=Received ringid(10.16.144.34:62876) seq 2 rec=[824056] Log Message=Received ringid(10.16.144.34:62876) seq 3 rec=[824057] Log Message=Received ringid(10.16.144.34:62876) seq 4 rec=[824058] Log Message=Received ringid(10.16.144.34:62876) seq 5 rec=[824059] Log Message=Received ringid(10.16.144.34:62876) seq 6 rec=[824060] Log Message=Received ringid(10.16.144.34:62876) seq 7 rec=[824061] Log Message=Received ringid(10.16.144.34:62876) seq 8 rec=[824062] Log Message=Received ringid(10.16.144.34:62876) seq 9 rec=[824063] Log Message=Received ringid(10.16.144.34:62876) seq a rec=[824064] Log Message=Received ringid(10.16.144.34:62876) seq b rec=[824065] Log Message=Received ringid(10.16.144.34:62876) seq c rec=[824066] Log Message=Received ringid(10.16.144.34:62876) seq d rec=[824067] Log Message=Received ringid(10.16.144.34:62876) seq e rec=[824068] Log Message=Received ringid(10.16.144.34:62876) seq f rec=[824069] Log Message=Received ringid(10.16.144.34:62876) seq 10 rec=[824070] Log Message=Received ringid(10.16.144.34:62876) seq 11 rec=[824071] Log Message=Received ringid(10.16.144.34:62876) seq 12 rec=[824072] Log Message=Received ringid(10.16.144.34:62876) seq 13 rec=[824073] Log Message=Received ringid(10.16.144.34:62876) seq 14 rec=[824074] Log Message=Received ringid(10.16.144.34:62876) seq 15 rec=[824075] Log Message=Received ringid(10.16.144.34:62876) seq 16 rec=[824076] Log Message=Received ringid(10.16.144.34:62876) seq 17 rec=[824077] Log Message=Received ringid(10.16.144.34:62876) seq 18 rec=[824078] Log Message=Received ringid(10.16.144.34:62876) seq 19 rec=[824079] Log Message=Received ringid(10.16.144.34:62876) seq 1a rec=[824080] Log Message=Received ringid(10.16.144.34:62876) seq 1b rec=[824081] Log Message=Received ringid(10.16.144.34:62876) seq 1c rec=[824082] Log Message=Received ringid(10.16.144.34:62876) seq 1d rec=[824083] Log Message=Received ringid(10.16.144.34:62876) seq 1e rec=[824084] Log Message=Received ringid(10.16.144.34:62876) seq 1f rec=[824085] Log Message=Received ringid(10.16.144.34:62876) seq 20 rec=[824086] Log Message=Received ringid(10.16.144.34:62876) seq 21 rec=[824087] Log Message=token retrans flag is 0 my set retrans flag0 retrans queue empty 1 count 0, aru 21 rec=[824088] Log Message=install seq 0 aru 21 high seq received 21 rec=[824089] Log Message=Received ringid(10.16.144.34:62876) seq 22 rec=[824090] Log Message=Received ringid(10.16.144.34:62876) seq 23 rec=[824091] Log Message=token retrans flag is 0 my set retrans flag0 retrans queue empty 1 count 1, aru 21 rec=[824092] Log Message=install seq 23 aru 21 high seq received 21 rec=[824093] Log Message=token retrans flag is 0 my set retrans flag0 retrans queue empty 1 count 2, aru 21 rec=[824094] Log Message=install seq 23 aru 21 high seq received 21 rec=[824095] Log Message=token retrans flag is 0 my set retrans flag0 retrans queue empty 1 count 3, aru 21 rec=[824096] Log Message=install seq 23 aru 21 high seq received 21 rec=[824097] Log Message=token retrans flag is 0 my set retrans flag0 retrans queue empty 1 count 4, aru 21 rec=[824098] Log Message=install seq 23 aru 21 high seq received 21 rec=[824099] Log Message=token retrans flag is 0 my set retrans flag0 retrans queue empty 1 count 5, aru 21 rec=[824100] Log Message=install seq 23 aru 21 high seq received 21 rec=[824101] Log Message=Received ringid(10.16.144.34:62876) seq 22 rec=[824102] Log Message=Received ringid(10.16.144.34:62876) seq 23 rec=[824103] Log Message=token retrans flag is 0 my set retrans flag0 retrans queue empty 1 count 6, aru 21 rec=[824104] Log Message=install seq 23 aru 21 high seq received 21 rec=[824105] Log Message=Received ringid(10.16.144.34:62876) seq 22 rec=[824106] Log Message=Received ringid(10.16.144.34:62876) seq 23 rec=[824107] Log Message=token retrans flag is 0 my set retrans flag0 retrans queue empty 1 count 7, aru 21 rec=[824108] Log Message=install seq 23 aru 21 high seq received 21 rec=[824109] Log Message=Received ringid(10.16.144.34:62876) seq 22 rec=[824110] Log Message=Received ringid(10.16.144.34:62876) seq 23 rec=[824111] Log Message=token retrans flag is 0 my set retrans flag0 retrans queue empty 1 count 8, aru 21 rec=[824112] Log Message=install seq 23 aru 21 high seq received 21 rec=[824113] Log Message=Received ringid(10.16.144.34:62876) seq 22 rec=[824114] Log Message=Received ringid(10.16.144.34:62876) seq 23 rec=[824115] Log Message=token retrans flag is 0 my set retrans flag0 retrans queue empty 1 count 9, aru 21 rec=[824116] Log Message=install seq 23 aru 21 high seq received 21 rec=[824117] Log Message=Received ringid(10.16.144.34:62876) seq 22 rec=[824118] Log Message=Received ringid(10.16.144.34:62876) seq 23 rec=[824119] Log Message=token retrans flag is 0 my set retrans flag0 retrans queue empty 1 count 10, aru 21 rec=[824120] Log Message=install seq 23 aru 21 high seq received 21 rec=[824121] Log Message=Received ringid(10.16.144.34:62876) seq 22 rec=[824122] Log Message=Received ringid(10.16.144.34:62876) seq 23 rec=[824123] Log Message=token retrans flag is 0 my set retrans flag0 retrans queue empty 1 count 11, aru 21 rec=[824124] Log Message=install seq 23 aru 21 high seq received 21 rec=[824125] Log Message=Received ringid(10.16.144.34:62876) seq 22 rec=[824126] Log Message=Received ringid(10.16.144.34:62876) seq 23 rec=[824127] Log Message=token retrans flag is 0 my set retrans flag0 retrans queue empty 1 count 12, aru 21 rec=[824128] Log Message=install seq 23 aru 21 high seq received 21 rec=[824129] Log Message=Received ringid(10.16.144.34:62876) seq 22 rec=[824130] Log Message=Received ringid(10.16.144.34:62876) seq 23 rec=[824131] Log Message=token retrans flag is 0 my set retrans flag0 retrans queue empty 1 count 13, aru 21 rec=[824132] Log Message=install seq 23 aru 21 high seq received 21 rec=[824133] Log Message=Received ringid(10.16.144.34:62876) seq 22 rec=[824134] Log Message=Received ringid(10.16.144.34:62876) seq 23 rec=[824135] Log Message=token retrans flag is 0 my set retrans flag0 retrans queue empty 1 count 14, aru 21 rec=[824136] Log Message=install seq 23 aru 21 high seq received 21 rec=[824137] Log Message=Received ringid(10.16.144.34:62876) seq 22 rec=[824138] Log Message=Received ringid(10.16.144.34:62876) seq 23 rec=[824139] Log Message=token retrans flag is 0 my set retrans flag0 retrans queue empty 1 count 15, aru 21 rec=[824140] Log Message=install seq 23 aru 21 high seq received 21 rec=[824141] Log Message=Received ringid(10.16.144.34:62876) seq 22 rec=[824142] Log Message=Received ringid(10.16.144.34:62876) seq 23 rec=[824143] Log Message=token retrans flag is 0 my set retrans flag0 retrans queue empty 1 count 16, aru 21 rec=[824144] Log Message=install seq 23 aru 21 high seq received 21 rec=[824145] Log Message=Received ringid(10.16.144.34:62876) seq 22 rec=[824146] Log Message=Received ringid(10.16.144.34:62876) seq 23 rec=[824147] Log Message=Process pause detected for 8209 ms, flushing membership messages. rec=[824148] Log Message=entering GATHER state from 11. rec=[824149] Log Message=Restoring instance->my_aru 9381 my high seq received 938e rec=[824150] Log Message=entering GATHER state from 5. rec=[824151] Log Message=entering GATHER state from 14. rec=[824152] Log Message=Process pause detected for 8209 ms, flushing membership messages. Finishing replay: records found [34214] aru should eventually reach 23 since there are 24 sequence numbers in this recovery message transfer. We get stuck at 21. This happens via update_aru which calculates based upon the value of instance->my_high_seq_received. This value is only set on receipt of multicast messages that meet the following criteria: if (msg_len > 0 && msg_len <= FRAME_SIZE_MAX && sq_in_range (sort_queue, mcast_header.seq) && sq_item_inuse (sort_queue, mcast_header.seq) == 0) { ... update my_high_seq_received ... } Added some more information to see why the variable wasn't being set. Got following: rec=[2582244] Log Message=token retrans flag is 1 my set retrans flag0 retrans queue empty 1 count 0, aru ffffffff rec=[2582245] Log Message=install seq 0 aru 1 high seq received 1 rec=[2582246] Log Message=Received ringid(10.16.144.34:11064) seq 1 encap 1 rec=[2582247] Log Message=msg len 264 in range 1 inuse 1 rec=[2582248] Log Message=Received ringid(10.16.144.34:11064) seq 2 encap 1 rec=[2582249] Log Message=msg len 263 in range 1 inuse 0 rec=[2582250] Log Message=Process pause detected for 18786 ms, flushing membership messages. rec=[2582251] Log Message=entering GATHER state from 11. rec=[2582252] Log Message=Restoring instance->my_aru 14 my high seq received 14 rec=[2582253] Log Message=entering GATHER state from 5. rec=[2582254] Log Message=entering GATHER state from 14. rec=[2582255] Log Message=Process pause detected for 18786 ms, flushing membership messages. Finishing replay: records found [35694] This occurs at same time as kernel dmesg output on console: megaraid_sas 0000:03:00.0: vpd r/w failed. This is likely a firmware bug on this device. Contact the card vendor for a firmware update. The only buzz node with a failure contains the following in its dmesg log: mpt2sas 0000:03:00.0: vpd r/w failed. This is likely a firmware bug on this device. Contact the card vendor for a firmware update. Bug #732956 might be related. Thread at http://www.spinics.net/lists/linux-scsi/msg52998.html might be related. Corosync will not function well when the kernel pauses for long periods of time. Jaroslav, can you validate the problem does not accompany a vpd r/w failed error in your system logs? Thanks -steve memb_join messages are occurring within the recv_flush operation. This results in entering the gather state from recovery at totemsrp.c:4253. The token is accepted in recovery but during token processing the protocol switches to gather. It then operates on the regular sort queue when it should be operating on the recovery sort queue. One possible workaround would be to ignore all messages in the flush operation which are not of type MEMB_TYPE_MCAST. If that simple workaround doesn't work, the flush operation will have to create a list of non MEMB_TYPE_MCAST messages and deliver them in some other way. This second workaround sounds complicated. I'll give first workaround a spin in next few hours. Not sure on the other problems exposed in this bugzilla. Created attachment 520740 [details]
patch which fixes problem
Technical note added. If any revisions are required, please edit the "Technical Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. New Contents: Prior to this update, when corosync ran the "cman_tool join" and "cman_tool leave" commands in a loop, corosync sometimes terminated unexpectedly. This bug has been fixed, and corosync no longer crashes in the described scenario. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHBA-2011-1515.html |