Created attachment 378340 [details] code for 2 test cases Description of problem: corosync shutdown process leaves open a window in which the daemon starts to shutdown but still accepts new IPC connections. Because of this wrong behavior several issues can be seen: a) corosync crashes on shutdown (always) b) clients connected to corosync hangs (almost always - very racy) c) corosync does not shutdown at all (rare - also racy) Version-Release number of selected component (if applicable): 1.2.0-1 How reproducible: depends on the test case and race that we hit from test run to test run. Steps to Reproduce: 1. build the test case in attachment (defaults to trigger #a and #b). 2. fire corosync -f (or even just corosync) in one terminal 3. fire the test case on another terminal 4a. kill -TERM $(pidof corosync) 4b. ctrl+c (if executing corosync -f) Actual results: a)corosync will always segfault on exit: ^CDec 14 20:58:34 corosync [SERV ] Unloading all Corosync service engines. Dec 14 20:58:34 corosync [SERV ] Service engine unloaded: corosync extended virtual synchrony service Dec 14 20:58:34 corosync [SERV ] Service engine unloaded: corosync configuration service Dec 14 20:58:34 corosync [SERV ] Service engine unloaded: corosync cluster closed process group service v1.01 Dec 14 20:58:34 corosync [SERV ] Service engine unloaded: corosync cluster config database access v1.01 Dec 14 20:58:34 corosync [SERV ] Service engine unloaded: corosync profile loading service Dec 14 20:58:34 corosync [SERV ] Service engine unloaded: corosync cluster quorum service v0.1 Segmentation fault b) the test case will hang forever waiting for a reply from corosync (that´s already dead): 0x00a68416 in __kernel_vsyscall () Missing separate debuginfos, use: debuginfo-install glibc-2.11-2.i686 (gdb) bt #0 0x00a68416 in __kernel_vsyscall () #1 0x007d0f85 in sem_wait@@GLIBC_2.1 () from /lib/libpthread.so.0 #2 0x00952628 in reply_receive (res_len=<value optimized out>, res_msg=<value optimized out>, ipc_instance=<value optimized out>) at coroipcc.c:465 #3 coroipcc_msg_send_reply_receive (res_len=<value optimized out>, res_msg=<value optimized out>, ipc_instance=<value optimized out>) at coroipcc.c:973 #4 0x00ca924c in confdb_key_increment (handle=<value optimized out>, parent_object_handle=<value optimized out>, key_name=<value optimized out>, key_name_len=<value optimized out>, value=<value optimized out>) at confdb.c:1001 #5 0x0804892d in test_objdb () #6 0x0804897b in main () (gdb) q c) i don´t have a reduced test case for it yet, but we (chrissie and I) have seen that corosync will stop the shutdown process and all threads are locked on a mutex in objdb. We reproduce this one using cluster stable3 on top of corosync. Note that this bug might be unrelated to this same issue (it´s difficult to discern the issues) but in case we will file another bug. d) even more rarely we have seen this backtrace (always in conjunction with a shutdown operation and active clients): *** buffer overflow detected ***: corosync terminated ======= Backtrace: ========= /lib/libc.so.6(__fortify_fail+0x4d)[0x4bec5d] /lib/libc.so.6(+0xf1d7a)[0x4bcd7a] /usr/libexec/lcrso/service_confdb.lcrso(+0x1788)[0x257788] /usr/lib/libcoroipcs.so.4(+0x2105)[0xdda105] /lib/libpthread.so.0(+0x5ab5)[0x29dab5] /lib/libc.so.6(clone+0x5e)[0x4a583e] Expected results: - corosync should not crash - ipc should be locked (to avoid new connections) before shutdown process starts - current ipc connections should be notified of shutdown (so clients can fail gracefully rather than hanging) Additional info: As mentioned.. there might be multiple bugs involved. Please split as required and we will follow up as necessary. This is a major blocker at the moment.
The reason this is a critical issue is if the config file is invalid or too old, corosync will be told to exit immediately which can trigger init lockup.
Created attachment 378976 [details] Proposed patch Patch sent to ml
The patch seems to have done some progress but doesn´t fix it all. Here are the test results using the test cases attached to this bug. test_objdb seems to pass now and the client connection is closed without problems but this might simply be luck because the next test shows that there is still a race condition. test_conn still triggers a segfault. term1: corosync -f .... Dec 17 14:12:20 corosync [MAIN ] Completed service synchronization, ready to provide service. term2: ./test term3: killall -TERM corosync and term1 segfaults.
(In reply to comment #0) > c) corosync does not shutdown at all (rare - also racy) --- > c) i don´t have a reduced test case for it yet, but we (chrissie and I) have > seen that corosync will stop the shutdown process and all threads are locked on > a mutex in objdb. We reproduce this one using cluster stable3 on top of > corosync. Note that this bug might be unrelated to this same issue (it´s > difficult to discern the issues) but in case we will file another bug. I cannot find a way to reproduce this in a small test case. This is a way to do it: - setup a cman cluster (2/3 nodes) on one of the nodes: term1: cman_tool -d join term2: while [ ! -f stop ]; do ccs_tool query /cluster/@name; done term1: cman_tool leave This will cause ccs_tool query to hang connected with that objdb mutex lock hold and corosync will _never_ shutdown. Probably worth a separate bug, if so let´s just clone this one. Fabio
Segfaults are caused by pthread_join in recently added code. Some other tweaks will be needed to that code to work properly. One of them could be mutex together with variable. Variable can be pointer to pthread_t or NULL, if thread is not exist. Of course, maybe Steve will find proper better and nicer solution, but I'm pretty off to PTO.
Created attachment 382272 [details] Proposed patch - take 2 Fabbio, this is different patch. It should be applicable on current corosync trunk. In my test (test case in next attachment) it never fall. I didn't had time to test c) case.
Created attachment 382274 [details] Test case This is testcase, which I use for testing patch - take2. testconn and testobjdb are compiled "code for 2 test cases".
So the patch looks a lot better. I don´t see any more hanging on shutdowns and IPC seems to exit properly. I can still see a segfault in the executive at shutdown, when running the test case code. Anyway this is a super big step forward. Fabio
Created attachment 394513 [details] proposed patchset - 1/6
Created attachment 394514 [details] proposed patchset - 2/6
Created attachment 394515 [details] proposed patchset - 3/6
Created attachment 394516 [details] proposed patchset - 4/6
Created attachment 394517 [details] proposed patchset - 5/6
Created attachment 394518 [details] proposed patchset - 6/6
I hope, that with all patches (proposed patchset) problem is gone (all patches are already in Trunk). Closing this as upstream.