Bug 590898 - corosync blocks on exit with debug: on enabled
Summary: corosync blocks on exit with debug: on enabled
Alias: None
Product: Fedora
Classification: Fedora
Component: corosync
Version: rawhide
Hardware: All
OS: Linux
Target Milestone: ---
Assignee: Steven Dake
QA Contact: Fedora Extras Quality Assurance
Depends On:
TreeView+ depends on / blocked
Reported: 2010-05-10 21:58 UTC by Steven Dake
Modified: 2016-04-26 21:50 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Last Closed: 2010-07-07 16:46:31 UTC
Type: ---

Attachments (Terms of Use)

Description Steven Dake 2010-05-10 21:58:18 UTC
Description of problem:
corosync gets stuck in shutdown

Version-Release number of selected component (if applicable):

How reproducible:
opensuse dependent

Steps to Reproduce:
Actual results:
locks up

Expected results:
doesn't lock up

Additional info:

User attached to process and found this backtrace of all threads:

Thread 3 (Thread 0x7f679067e910 (LWP 19541)):
#0  0x00007f6792c41da6 in logsys_worker_thread (data=<value optimized out>) at logsys.c:766
#1  0x00007f679261865d in start_thread () from /lib64/libpthread.so.0
#2  0x00007f6792183e1d in clone () from /lib64/libc.so.6
#3  0x0000000000000000 in ?? ()

Thread 2 (Thread 0x7f679317dfb0 (LWP 19542)):
#0  0x00007f679261d965 in ?? () from /lib64/libpthread.so.0
#1  0x00000000004091b8 in prioritized_timer_thread (data=<value optimized out>) at timer.c:135
#2  0x00007f679261865d in start_thread () from /lib64/libpthread.so.0
#3  0x00007f6792183e1d in clone () from /lib64/libc.so.6
#4  0x0000000000000000 in ?? ()

Thread 1 (Thread 0x7f67932756f0 (LWP 19540)):
#0  0x00007f679261996d in pthread_join () from /lib64/libpthread.so.0
#1  0x0000000000407595 in _corosync_exit_error (err=AIS_DONE_EXIT, file=<value optimized out>, line=<value optimized out>) at util.c:97
#2  0x0000000000406d3b in unlink_all_completed () at main.c:160
#3  0x0000000000408aa3 in service_exit_schedwrk_handler (data=0x7f679067e9e0) at service.c:614
#4  0x000000000040c64b in schedwrk_do (type=<value optimized out>, context=<value optimized out>) at schedwrk.c:77
#5  0x00007f6792e5b561 in token_callbacks_execute (type=<value optimized out>, instance=<value optimized out>) at totemsrp.c:3209
#6  message_handler_orf_token (type=<value optimized out>, instance=<value optimized out>) at totemsrp.c:3601
#7  0x00007f6792e51cd3 in rrp_deliver_fn (context=0x63e790, msg=0x661cd8, msg_len=70) at totemrrp.c:1393
#8  0x00007f6792e50cf2 in net_deliver_fn (handle=<value optimized out>, fd=<value optimized out>, revents=<value optimized out>, data=<value optimized out>) at totemudp.c:1223
#9  0x00007f6792e4cdda in poll_run (handle=2240235047305084928) at coropoll.c:396
#10 0x0000000000405c44 in main (argc=4, argv=<value optimized out>) at main.c:1556

Comment 1 Steven Dake 2010-05-10 22:06:35 UTC
logsys.c:766 is
                        log_rec_idx = record_read (buf, log_rec_idx, &log_msg);

What if this function is spinning.

In that case
logsys.c:785 would never call pthread_exit

and then the pthread_join in the main thread would not collect the exit status of the thread and block indefinately on exit.

a break statement that occurs when no messages are waiting for flushing

Comment 2 Steven Dake 2010-05-10 22:20:59 UTC
steps to reproduce
place debug: on in config file
service corosync start
wait 10 seconds
service corosync stop

generates exact stack trace above.

Comment 3 Jan Friesse 2010-05-11 08:46:28 UTC
From my debug it is really problem in logsys (overwriting own its memory).

Because of: <sdake> about got logsys rewritten, reassigning back to Steve.

Note You need to log in before you can comment on or make changes to this bug.