Red Hat Bugzilla – Bug 726142
openais deadlocks during rapid start/stop of process
Last modified: 2016-04-26 09:51:19 EDT
Description of problem: Alan can add more details of the test case. Version-Release number of selected component (if applicable): openais-0.80.6-28.el5_6.1 How reproducible: not sure Steps to Reproduce: 1.need more detail from Alan Conway 2. 3. Actual results: Expected results: Additional info:
Created attachment 515559 [details] core file created by sending segv to the deadlocked process
(gdb) thread apply all bt Thread 10 (Thread 18873): #0 0x00000031374cb696 in poll () from /lib64/libc.so.6 #1 0x000000000041cca5 in prioritized_timer_thread (data=<value optimized out>) at timer.c:125 #2 0x000000313800673d in start_thread () from /lib64/libpthread.so.0 #3 0x00000031374d44bd in clone () from /lib64/libc.so.6 Thread 9 (Thread 18876): #0 0x000000313800a048 in pthread_mutex_unlock () from /lib64/libpthread.so.0 #1 0x0000003139c0a64a in _Unwind_Find_FDE () from /lib64/libgcc_s.so.1 #2 0x0000003139c07625 in ?? () from /lib64/libgcc_s.so.1 #3 0x0000003139c08a0e in ?? () from /lib64/libgcc_s.so.1 #4 0x0000003139c08e1b in _Unwind_ForcedUnwind () from /lib64/libgcc_s.so.1 #5 0x000000313800d160 in __pthread_unwind () from /lib64/libpthread.so.0 #6 0x0000003138005400 in sigcancel_handler () from /lib64/libpthread.so.0 #7 <signal handler called> #8 0x000000313800aee9 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #9 0x0000000000417b2f in worker_thread (thread_data_in=<value optimized out>) at wthread.c:73 #10 0x000000313800673d in start_thread () from /lib64/libpthread.so.0 #11 0x00000031374d44bd in clone () from /lib64/libc.so.6 ---Type <return> to continue, or q <return> to quit--- Thread 8 (Thread 28753): #0 0x00000031374d5887 in semop () from /lib64/libc.so.6 #1 0x000000000041b835 in pthread_ipc_consumer (conn=0x1ed76900) at ipc.c:412 #2 0x000000313800673d in start_thread () from /lib64/libpthread.so.0 #3 0x00000031374d44bd in clone () from /lib64/libc.so.6 Thread 7 (Thread 29052): #0 0x00000031374d5887 in semop () from /lib64/libc.so.6 #1 0x000000000041b835 in pthread_ipc_consumer (conn=0x1ed76250) at ipc.c:412 #2 0x000000313800673d in start_thread () from /lib64/libpthread.so.0 #3 0x00000031374d44bd in clone () from /lib64/libc.so.6 Thread 6 (Thread 29338): #0 0x00000031374d5887 in semop () from /lib64/libc.so.6 #1 0x000000000041b835 in pthread_ipc_consumer (conn=0x2aaab00009f0) at ipc.c:412 #2 0x000000313800673d in start_thread () from /lib64/libpthread.so.0 #3 0x00000031374d44bd in clone () from /lib64/libc.so.6 Thread 5 (Thread 32532): #0 0x000000313800a048 in pthread_mutex_unlock () from /lib64/libpthread.so.0 #1 0x00000000004158fb in hdb_handle_get (handle=<value optimized out>, ---Type <return> to continue, or q <return> to quit--- iovec=0x40e37e00, iov_len=1) at ../include/hdb.h:152 #2 totempg_groups_joined_reserve (handle=<value optimized out>, iovec=0x40e37e00, iov_len=1) at totempg.c:1157 #3 0x000000000041b894 in pthread_ipc_consumer (conn=0x2aaaf1440f20) at ipc.c:427 #4 0x000000313800673d in start_thread () from /lib64/libpthread.so.0 #5 0x00000031374d44bd in clone () from /lib64/libc.so.6 Thread 4 (Thread 627): #0 0x00000031374d5887 in semop () from /lib64/libc.so.6 #1 0x000000000041b835 in pthread_ipc_consumer (conn=0x2aaaf1649f20) at ipc.c:412 #2 0x000000313800673d in start_thread () from /lib64/libpthread.so.0 #3 0x00000031374d44bd in clone () from /lib64/libc.so.6 Thread 3 (Thread 1013): #0 0x000000313800d4c4 in __lll_lock_wait () from /lib64/libpthread.so.0 #1 0x0000003138008e1a in _L_lock_1034 () from /lib64/libpthread.so.0 #2 0x0000003138008cdc in pthread_mutex_lock () from /lib64/libpthread.so.0 #3 0x000000000041586b in totempg_groups_joined_reserve (handle=0, iovec=0x80, iov_len=0) at totempg.c:1155 #4 0x000000000041b894 in pthread_ipc_consumer (conn=0x2aaafe9f1a00) at ipc.c:427 ---Type <return> to continue, or q <return> to quit--- #5 0x000000313800673d in start_thread () from /lib64/libpthread.so.0 #6 0x00000031374d44bd in clone () from /lib64/libc.so.6 Thread 2 (Thread 1091): #0 0x000000313800d4c4 in __lll_lock_wait () from /lib64/libpthread.so.0 #1 0x0000003138008e1a in _L_lock_1034 () from /lib64/libpthread.so.0 #2 0x0000003138008cdc in pthread_mutex_lock () from /lib64/libpthread.so.0 #3 0x000000000041aeed in ipc_thread_active (conn=0x2aaac90ffc10) at ipc.c:187 #4 0x000000000041b81d in pthread_ipc_consumer (conn=0x2aaac90ffc10) at ipc.c:408 #5 0x000000313800673d in start_thread () from /lib64/libpthread.so.0 #6 0x00000031374d44bd in clone () from /lib64/libc.so.6 Thread 1 (Thread 18872): #0 0x000000313800d4c4 in __lll_lock_wait () from /lib64/libpthread.so.0 #1 0x0000003138008e1a in _L_lock_1034 () from /lib64/libpthread.so.0 #2 0x0000003138008cdc in pthread_mutex_lock () from /lib64/libpthread.so.0 #3 0x000000000041af51 in ipc_disconnect (conn_info=0x2aaac90ffc10) at ipc.c:802 #4 0x000000000041bd30 in msg_send (conn=0x2aaac90ffc10, iov=<value optimized out>, iov_len=1, locked=1) at ipc.c:1078 #5 0x000000000041c5ae in outq_flush (handle=<value optimized out>, fd=<value optimized out>, revent=1, data=0x2aaac90ffc10) at ipc.c:1118 ---Type <return> to continue, or q <return> to quit--- #6 poll_handler_connection (handle=<value optimized out>, fd=<value optimized out>, revent=1, data=0x2aaac90ffc10) at ipc.c:735 #7 0x0000000000405ce0 in poll_run (handle=0) at aispoll.c:402 #8 0x00000000004186be in main (argc=<value optimized out>, argv=<value optimized out>) at main.c:628
thread 1 is returning an error condition in ipc_disconnect:802. Unfortunately errno is not available. Would be helpful to know what this error condition is - perhaps this code is not typically exercised.
Created attachment 515571 [details] Reproducer script Script to reproduce the failure, run in a loop. Failure only seen once, after 2 days of running the test continuously.
To run the test: make while true; do ./runme.sh; done
Honza Try to reproduce and determine value of errno in thread 1 ipc_disconnect:802.
Steve, I was not able reproduce this problem after +- 3 days of heavy running attached test case, but, with simple modification of msg_send in ipc.c to set err to -1 and set errno to random value, and really deadlock was there. So there is for sure bug. Also I'm pretty sure there is error like ECONNRESET or something like that. My proposed solution: - Use reentrant mutex for conn_info->mutex. This shouldn't change any code logic and it's very small change. - Or use code for (res == -1 && errno == EAGAIN) for every res == -1. This will ensure that disconnection is handled by upper layers and/or later and non critical errors are also handled by retry. Steve, what's your opinion there?
do you mean PTHREAD_MUTEX_RECURSIVE? This seems reasonable - another option would be to create a an ipc_disconnect_unlocked() function if locked == 1. As you fix the bug, please handle EINTR by retry in ipc.c send operation in case that is the cause of the ipc_disconnect call. ECONNRESET is a possibility too (which would just call ipc_disconnect).
Created attachment 518695 [details] Proposed patch 1 - Handle EINTR This patch handles EINTR by retry send.
Created attachment 518696 [details] Proposed patch 2 - Don't deadlock on ipc_disconnect Patch handles deadlock by preventing double lock.
Alan, because I was not able to reproduce issue, can you please test attached two patches and confirm/deny that it solves problem? Thanks, Honza
re comment 13: I can't readily reproduce this either. I only saw it once and that was after running the test for 2 days.
Technical note added. If any revisions are required, please edit the "Technical Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. New Contents: In rare cases, rapidly starting and stopping the openais service sometimes resulted in a deadlock. This update adapts the underlying source code to prevent such a deadlock, resolving this issue. Deadlock situations no longer happen during a rapid start/stop process.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHBA-2012-0180.html