Description of problem: corosync gets stuck in shutdown Version-Release number of selected component (if applicable): corosync-1.2.1 How reproducible: opensuse dependent Steps to Reproduce: 1. 2. 3. Actual results: locks up Expected results: doesn't lock up Additional info: User attached to process and found this backtrace of all threads: Thread 3 (Thread 0x7f679067e910 (LWP 19541)): #0 0x00007f6792c41da6 in logsys_worker_thread (data=<value optimized out>) at logsys.c:766 #1 0x00007f679261865d in start_thread () from /lib64/libpthread.so.0 #2 0x00007f6792183e1d in clone () from /lib64/libc.so.6 #3 0x0000000000000000 in ?? () Thread 2 (Thread 0x7f679317dfb0 (LWP 19542)): #0 0x00007f679261d965 in ?? () from /lib64/libpthread.so.0 #1 0x00000000004091b8 in prioritized_timer_thread (data=<value optimized out>) at timer.c:135 #2 0x00007f679261865d in start_thread () from /lib64/libpthread.so.0 #3 0x00007f6792183e1d in clone () from /lib64/libc.so.6 #4 0x0000000000000000 in ?? () Thread 1 (Thread 0x7f67932756f0 (LWP 19540)): #0 0x00007f679261996d in pthread_join () from /lib64/libpthread.so.0 #1 0x0000000000407595 in _corosync_exit_error (err=AIS_DONE_EXIT, file=<value optimized out>, line=<value optimized out>) at util.c:97 #2 0x0000000000406d3b in unlink_all_completed () at main.c:160 #3 0x0000000000408aa3 in service_exit_schedwrk_handler (data=0x7f679067e9e0) at service.c:614 #4 0x000000000040c64b in schedwrk_do (type=<value optimized out>, context=<value optimized out>) at schedwrk.c:77 #5 0x00007f6792e5b561 in token_callbacks_execute (type=<value optimized out>, instance=<value optimized out>) at totemsrp.c:3209 #6 message_handler_orf_token (type=<value optimized out>, instance=<value optimized out>) at totemsrp.c:3601 #7 0x00007f6792e51cd3 in rrp_deliver_fn (context=0x63e790, msg=0x661cd8, msg_len=70) at totemrrp.c:1393 #8 0x00007f6792e50cf2 in net_deliver_fn (handle=<value optimized out>, fd=<value optimized out>, revents=<value optimized out>, data=<value optimized out>) at totemudp.c:1223 #9 0x00007f6792e4cdda in poll_run (handle=2240235047305084928) at coropoll.c:396 #10 0x0000000000405c44 in main (argc=4, argv=<value optimized out>) at main.c:1556
logsys.c:766 is log_rec_idx = record_read (buf, log_rec_idx, &log_msg); What if this function is spinning. In that case logsys.c:785 would never call pthread_exit and then the pthread_join in the main thread would not collect the exit status of the thread and block indefinately on exit. a break statement that occurs when no messages are waiting for flushing
steps to reproduce place debug: on in config file service corosync start test/cpgbench wait 10 seconds service corosync stop generates exact stack trace above.
From my debug it is really problem in logsys (overwriting own its memory). Because of: <sdake> about got logsys rewritten, reassigning back to Steve.